Evaluating machine translation output is no easy feat. In order to quantify the quality of machine translation, the first thought that comes to mind is a human evaluation: allowing linguists or translators to offer their professional opinion on what constitutes a good or a bad translation.
However, a human evaluation requires a lot of time and effort. Moreover, human evaluations are difficult to replicate as different evaluators will undoubtedly have varying opinions on the quality of a translation.
Therefore, many attempts have been made at creating automated metrics that accurately estimate the quality of machine translation output. By uploading a test file, these metrics can automatically predict the quality of a specific machine translation engine, or can compare a couple to pick out the best one. In some cases the metrics provide a score for the translation of a single sentence, in others for the translation of a whole document.
Statistical evaluation scores in Machine Translation
Many automated metrics were developed in the early days of statistical machine translation (engines built using statistical models). These metrics calculate the similarity between machine translation output and a human reference translation (or multiple translations, if available). To name but a few:
- Edit distance. Edit distance is used as a metric for evaluating machine translation output by counting the edits (additions, deletions or changes) required to change the machine translation output for a sentence into the reference (human) translation. The lower the number (i.e. the fewer edits, always resulting in a number between 0 and 100), the better the machine translation output quality.
The edit distance score is mostly used to indicate the post-editing effort required. The higher the score, the more changes are required from a post-editor, therefore the longer the post-editing process takes.
- TER (Translation Edit Rate). This score is a variant of edit distance. This metric calculates the distance at word level rather than at character level, and can take into account shifts of word groups. Again, the lower the TER score, the better the MT quality.
- BLEU (Bilingual Evaluation Understudy). This is the most frequently used metric for evaluating machine translation output. Based on a mathematical formula, it reflects the similarity between the machine translation output and the reference translation(s) for a document. A score between 0 and 30 indicates that the machine translation output carries a low level of similarity to the reference translations. A score over 30 means that the machine translation output is understandable, and a score over 50 that it is a good and fluent translation. The maximum BLEU score that can be reached lies around 80.
- METEOR (Metric for Evaluation of Translation with Explicit ORdering). This metric is an extension of BLEU that mainly aims at improving sentence-level scores and allows for stemming and synonymy matching in a couple of languages. This means that METEOR is better at detecting whether synonyms have been used in the translation. If that is the case, it will not penalize the output but update the score accordingly. Similar to BLEU, the higher the METEOR score, the better.
Neural Scores
Newer machine translation evaluation metrics rely on neural models which generate more accurate results. Some examples, according to the latest research, include:
- BERTscore. This metric makes use of a neural BERT model (Bidirectional Encoder Representations from Transformers), a type of model that carries a huge amount of linguistic and world knowledge for one or more languages. Much like statistical metrics, the BERTscore metric computes a similarity between machine translation output and the reference translation. However, it computes the similarity using contextual embeddings, which means it tries to capture the meaning of the output and compare it with the intended meaning.
- COMET (Crosslingual Optimized Metric for Evaluation of Translation). COMET calculates the similarity between a machine translation output and a reference translation using token or sentence embedding. Unlike the BERTscore, COMET is designed to predict human evaluations of MT quality, thereby aiming to give a more accurate insight. The COMET metric is reflected in a score between 0 and 1.
Benefits and downsides of automated metrics
The primary advantage of using automated metrics compared to a human evaluation is speed. Tools such as the Engine Advisory (part of the CrossLang MT Gateway) can generate scores in a matter of minutes.
The downside is that automated metrics are not able to identify exactly why one engine might score better than another. For example, one engine might score high because it is able to generate idiomatic translations, another one might do well due to its correct use of terminology. If you want to gain insight into this type of specific information, a human evaluation is better suited.
Automated scores are particularly interesting for research and development purposes. They deliver quick results which are easily reproducible. When building a custom machine translation engine for example, automated metrics are ideal to see how and to what extent an engine is improving.
As the use of machine translation is growing every year, having a reliable system to keep an eye on the quality of the output is a useful tool for translators, LSPs and anyone who is interested in using machine translation in their translation workflow.