Brno University of Technology (Department of Computer Graphics and Multimedia)
University of Ghent (Centre for Digital Humanities)
OCCAM stands for OCR, ClassificAtion and Machine Translation. The project develops software integrating image classification, translation memories, optical character recognition and machine translation to support the automated translation of scanned or handwritten documents. Two different use cases are investigated:
- Connecting the Business Registers Interconnection System of the European Commission with the EC’s eTranslation system. This allows for large volumes of image-based business documents that currently cannot be processed to become accessible in multiple languages.
- In the area of digital humanities, the combination of digitising and translating historical texts allows for a significant growth and diversification of accessible corpora, helping to preserve cultural heritage and leading to the discovery of new knowledge and insights.
A demo version of the software was built, allowing users to upload a scanned document, process it, translate it using an external MT engine, and download the resulting transcription and translation. The software is available through an open-source repository, allowing the code to be deployed on a local server.
The technology developed can be applied widely in environments where multilingual access to registers is required and where researchers in universities, libraries, museums and archives need to process and access corpora in various languages and domains.
OCCAM: Cross-lingual Unlocking of Non-digital Texts (2021)
Authors: Laurens Meeus, Joachim Van den Bogaert, Arne Defauw, Oan Stultjens, Sara Szoc, Tom Vanallemeersch, Frederic Everaert & Koen Van Winckel
OCR, Classification & Machine Translation (OCCAM) (2020)
Authors: Joachim Van den Bogaert, Arne Defauw, Frederic Everaert, Koen Van Winckel, Alina Kramchaninova, Anna Bardadym, Tom Vanallemeersch, Pavel Smrž & Michal Hradiš