Automated Evaluation of Machine Translation – An Overview

Evaluating machine translation output is no easy feat. In order to quantify the quality of machine translation, the first thought that comes to mind is a human evaluation: allowing linguists or translators to offer their professional opinion on what constitutes a good or a bad translation.

However, a human evaluation requires a lot of time and effort. Moreover, human evaluations are difficult to replicate as different evaluators will undoubtedly have varying opinions on the quality of a translation.

Therefore, many attempts have been made at creating automated metrics that accurately estimate the quality of machine translation output. By uploading a test file, these metrics can automatically predict the quality of a specific machine translation engine, or can compare a couple to pick out the best one. In some cases the metrics provide a score for the translation of a single sentence, in others for the translation of a whole document.

Statistical evaluation scores in Machine Translation

Many automated metrics were developed in the early days of statistical machine translation (engines built using statistical models). These metrics calculate the similarity between machine translation output and a human reference translation (or multiple translations, if available). To name but a few:

Edit distance. Edit distance is used as a metric for evaluating machine translation output by counting the edits (additions, deletions or changes) required to change the machine translation output for a sentence into the reference (human) translation. The lower the number (i.e. the fewer edits, always resulting in a number between 0 and 100), the better the machine translation output quality.

The edit distance score is mostly used to indicate the post-editing effort required. The higher the score, the more changes are required from a post-editor, therefore the longer the post-editing process takes.

TER (Translation Edit Rate). This score is a variant of edit distance. This metric calculates the distance at word level rather than at character level, and can take into account shifts of word groups. Again, the lower the TER score, the better the MT quality.

BLEU (Bilingual Evaluation Understudy). This is the most frequently used metric for evaluating machine translation output. Based on a mathematical formula, it reflects the similarity between the machine translation output and the reference translation(s) for a document. A score between 0 and 30 indicates that the machine translation output carries a low level of similarity to the reference translations. A score over 30 means that the machine translation output is understandable, and a score over 50 that it is a good and fluent translation. The maximum BLEU score that can be reached lies around 80.

METEOR (Metric for Evaluation of Translation with Explicit ORdering). This metric is an extension of BLEU that mainly aims at improving sentence-level scores and allows for stemming and synonymy matching in a couple of languages. This means that METEOR is better at detecting whether synonyms have been used in the translation. If that is the case, it will not penalize the output but update the score accordingly. Similar to BLEU, the higher the METEOR score, the better.

Neural Scores

Newer machine translation evaluation metrics rely on neural models which generate more accurate results. Some examples, according to the latest research, include:

BERTscore. This metric makes use of a neural BERT model (Bidirectional Encoder Representations from Transformers), a type of model that carries a huge amount of linguistic and world knowledge for one or more languages. Much like statistical metrics, the BERTscore metric computes a similarity between machine translation output and the reference translation. However, it computes the similarity using contextual embeddings, which means it tries to capture the meaning of the output and compare it with the intended meaning.
COMET (Crosslingual Optimized Metric for Evaluation of Translation). COMET calculates the similarity between a machine translation output and a reference translation using token or sentence embedding. Unlike the BERTscore, COMET is designed to predict human evaluations of MT quality, thereby aiming to give a more accurate insight. The COMET metric is reflected in a score between 0 and 1.

Benefits and downsides of automated metrics

The primary advantage of using automated metrics compared to a human evaluation is speed. Tools such as the Engine Advisory (part of the CrossLang MT Gateway) can generate scores in a matter of minutes.

The downside is that automated metrics are not able to identify exactly why one engine might score better than another. For example, one engine might score high because it is able to generate idiomatic translations, another one might do well due to its correct use of terminology. If you want to gain insight into this type of specific information, a human evaluation is better suited.

Automated scores are particularly interesting for research and development purposes. They deliver quick results which are easily reproducible. When building a custom machine translation engine for example, automated metrics are ideal to see how and to what extent an engine is improving.

As the use of machine translation is growing every year, having a reliable system to keep an eye on the quality of the output is a useful tool for translators, LSPs and anyone who is interested in using machine translation in their translation workflow.

Anonymisation and Machine Translation

What happens when an anonymised text is processed automatically? Read our blog to learn more about how MT deals with anonymised content.

The Process of Building a Custom Machine Translation Engine – From A to Z

What are custom machine translation engines? How are they built? Learn all about it in this blog.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_100343952_2	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.