Authors
Arne Defauw, Sara Szoc, Anna Bardadym, Joris Brabers, Frederic Everaert, Roko Mijic, Kim Scholte, Tom Vanallemeersch, Koen Van Winckel & Joachim Van den Bogaert
Abstract
To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel
sentences are needed. Typically, large amounts of data are scraped from multilingual web sites
and aligned into datasets for training. Many tools exist for automatic alignment of such datasets.
However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present
a tool for automatic misalignment detection (MAD). We treated the task of determining whether
a pair of aligned sentences constitutes a genuine translation as a supervised regression problem.
We trained our algorithm on a manually labeled dataset in the FR–NL language pair. Our algorithm
used shallow features and features obtained after an initial translation step. We showed that both
the Levenshtein distance between the target and the translated source, as well as the cosine distance
between sentence embeddings of the source and the target were the two most important features for
the task of misalignment detection. Using gold standards for alignment, we demonstrated that our
model can increase the quality of alignments in a corpus substantially, reaching a precision close to
100%. Finally, we used our tool to investigate the effect of misalignments on NMT performance.
To read the full article please fill out the form below:
"*" indicates required fields