Authors
Alina Kramchaninova & Arne Defauw
Abstract
Deep learning models have significantly
advanced the state of the art of question
answering systems. However, the majority
of datasets available for training such models
have been annotated by humans, are
open-domain, and are composed primarily
in English. To deal with these limitations,
we introduce a pipeline that creates synthetic
data from natural text. To illustrate
the domain-adaptability of our approach,
as well as its multilingual potential, we use
our pipeline to obtain synthetic data in English
and Dutch. We combine the synthetic
data with non-synthetic data (SQuAD 2.0)
and fine-tune multilingual BERT models
on the question answering task. Models
trained with synthetically augmented data
demonstrate a clear improvement in performance
when evaluated on the domainspecific
test set, compared to the models
trained exclusively on SQuAD 2.0. We expect
our work to be beneficial for training
domain-specific question-answering systems
when the amount of available data is
limited.
To read the full article please fill out the form below:
"*" indicates required fields