Arne Defauw, Tom Vanallemeersch, Sara Szoc, Frederic Everaert, Koen Van Winckel, Kim Scholte, Joris Brabers & Joachim Van den Bogaert
This paper investigates the effectiveness of the ParaCrawl pipeline for collecting domain-specific training data for machine translation. We follow the different steps of the pipeline (document alignment, sentence alignment, cleaning) and add a topic-filtering component. Experiments are performed on the legal domain for the English to French and English to Irish language pairs. We evaluate the pipeline at both intrinsic (alignment quality) and extrinsic (MT performance) levels. Our results show that with this pipeline we obtain high- quality alignments and significant improvements in MT quality.
To read the full article please fill out the form below:
"*" indicates required fields