Authors
Julia Ive, Lucia Specia, Sara Szoc, Tom Vanallemeersch, Joachim Van den Bogaert, Eduardo Farah, Christine Maroti, Artur Ventura & Maxim Khalilov
Abstract
We introduce a machine translation dataset for three pairs of languages in the legal domain with post-edited high-quality neural machine
translation and independent human references. The data was collected as part of the EU APE-QUEST project and comprises crawled
content from EU websites with translation from English into three European languages: Dutch, French and Portuguese. Altogether,
the data consists of around 31K tuples including a source sentence, the respective machine translation by a neural machine translation
system, a post-edited version of such translation by a professional translator, and – where available – the original reference translation
crawled from parallel language websites. We describe the data collection process, provide an analysis of the resulting post-edits and
benchmark the data using state-of-the-art quality estimation and automatic post-editing models. One interesting by-product of our
post-editing analysis suggests that neural systems built with publicly available general domain data can provide high-quality translations,
even though comparison to human references suggests that this quality is quite low. This makes our dataset a suitable candidate to test
evaluation metrics. The data is freely available as an ELRC-SHARE resource.
To read the full article please fill out the form below:
"*" indicates required fields