Authors
Arne Defauw, Tom Vanallemeersch, Sara Szoc, Frederic Everaert, Koen Van Winckel, Kim Scholte, Joris Brabers & Joachim Van den Bogaert
Abstract
This paper investigates the effectiveness
of the ParaCrawl pipeline for collecting
domain-specific training data for
machine translation. We follow the
different steps of the pipeline (document
alignment, sentence alignment, cleaning)
and add a topic-filtering component.
Experiments are performed on the legal
domain for the English to French and
English to Irish language pairs. We
evaluate the pipeline at both intrinsic
(alignment quality) and extrinsic (MT
performance) levels. Our results show
that with this pipeline we obtain high-
quality alignments and significant
improvements in MT quality.
To read the full article please fill out the form below:
"*" indicates required fields