Proprietary Tools Developed by the R&D Department in Ongoing Research
Research and development of software aimed at streamlining or expediting production are among the main tasks of the R&D department. Sometimes, however, in the course of ongoing research, our department has to create solutions for internal use. As machine translation (MT) has recently become the most active area of research, we would like share some insights into the tools that help us conduct relevant studies.
Over the period of MT research work, many short scripts were written that automated some of the work on our internal tasks. In most cases, those were scripts for generating certain reports, as well as scripts for mass translation of segments by machine translation providers using their REST APIs (whenever they could not be translated by the provider using CAT tools), scripts for format conversion of analytical data, scripts for working with neural networks, etc. Across all these diverse applications, we rely on our proficiency in various programming languages and the exploration of third-party libraries.
Since it is impossible to cover all of these proprietary tools in a short text, we will talk about larger developments and the tools we use most extensively. We have selected two tools, one that facilitates automated evaluation of translations across diverse metrics, and the other that improves the quality of translation memories for further model training.
In instances where a reference translation is available (whether it is a human translation or an edited machine translation), we can always compare it with the MT and evaluate the quality of the machine translation. This is necessary to determine which machine translation engine will provide the highest quality for a given client, language pair, and subject matter. For this purpose, we use various metrics that we have tested and implemented.
In fact, there is a significant number of computational methods for evaluating translations. For example, we currently employ such metrics as HLepor, NLepor, BLEU, NIST, RIBES, METEOR, RougeL, TER, and ChrF. We can select several metrics for evaluation or apply all of them at once. It depends on the urgency of the request, the language pair, and the domain. The more research and data we have in our database on a given client, the easier it is to make an evaluation.
However, the problem here is that these metrics have been developed by different people or companies, resulting in divergent data formats for both input and output. In addition, developing your own implementation of algorithms to calculate these metrics is an extremely costly and complicated task, and ready-made implementations of such calculations are written in different programming languages. With this in mind, we have developed a tool that allows us to consolidate, as much as possible, all these third-party libraries into a single utility tailored to work with the data format that suits us best.
It is called MTScore and it can analyze the quality of translations according to several of the metrics described above. This desktop tool works with a simple spreadsheet file containing three columns of segments for which calculations are made. At the output we get a readily-available report that we can use either to communicate the translation quality to our colleagues in production or to gain insights into the effectiveness of our model’s training.
While working with MTScore, we faced a problem: many libraries use machine learning frameworks, and we cannot simply add them to the Windows OS-based desktop version of the tool. That is why we are currently in the process of reconfiguring this utility for the new server version based on the Linux OS. This will substantially expand the range of possible metrics that we can use without relying on third-party services. We also expect it to have a positive impact on the quality of our research.
The second tool is the Janus TMX Tool. It is also a desktop tool that enables us to better prepare a translation memory (tmx) for further training of MT models. In fact, a translation memory may contain segments consisting of multiple sentences, which require further segmentation into distinct sentences in order to properly train the engines. A clear link between the source text and the translated text is preferable for training, so this utility can perform such segmentation in several ways.
Additionally, the program can subtract segments from one tmx that match those from another tmx, so that segments for a test translation can be obtained quickly, based on the segments from the first tmx. This can improve the evaluation of the model after its training, since these segments serve as a test dataset for a specific client, language pair, and domain.
Janus TMX can also search a translation memory for an array of regular expressions and generate a report of their detection in order to quickly identify broken segments, segments with the wrong writing direction, the wrong language, or other data that needs to be removed before training a new model on that translation memory. This improves the basic cleaning of the translation memory. When dealing with a particularly large translation memory, there is a higher likelihood of encountering segments containing data that is unsuitable for training purposes.
However, there are still many difficulties here, since segmentation is a difficult task in terms of algorithms. Essentially, the program divides a segment by certain characters, which could be located at various positions in the source text and the translation. And while human assistance is still necessary, this proprietary tool greatly facilitates our research in the field of machine translation and the training of customized models.
In conclusion, we would like to highlight that our team is constantly working on new MT solutions. These innovations undoubtedly give our company a competitive advantage. Janus Worldwide has almost completely abandoned third-party solutions. We are on track to complete substitution, which is appreciated by many of our clients.