Open position at JetBrains
Paid Internship - Duplicates detection
- Work schedule
- Na Hřebenech II 785/9, 147 00 Praha 4-Podolí, Česko
The project team is researching various approaches to detecting explicit and potential duplicates in text written in the IDE in natural languages (English to start with).
Our IntelliJ plugin helps streamline the process of writing technical documentation for a software application inside the IDE. It supports the concept of ‘single source’, which means that a chunk of content can be written once and re-used in multiple help articles or documentation outputs by including it by ID.
The proposed inspection should be able to detect identical pieces of content, or, more importantly, non-explicit duplicates so that they can be extracted to a library and reused. Such inspection should help:
- maintain consistency throughout sources
- avoid making multiple updates when an UI changes
- reduce the review and editing effort
- reduce localization costs
Comparing each chunk of text with all other chunks and suggesting duplicates based on the percentage of matches is not a task that can be run in the IDE at runtime on a large code base, so we expect you to research, try and test different approaches that may include ML, Elasticsearch, trigram search, the Apache Lucene engine, and whatever other approaches you can apply.
- Java/Kotlin knowledge
- Basic knowledge of natural language processing
- English (pre-intermediate and above)
- Create an inspection that can be run in the IDE or in an external web interface in the headless mode and provide data on the potential duplicates
- An intention action in the IDE that would suggest extracting such duplicates to reusable chunks
- An inspection that would analyze duplicates in the background and suggest replacing content with an existing chunk as you type (ideally)
It is like your own home away from home. You can come to the gym to work out, you will meet friends, or you can bring your children, who love it here, with you.