
Massive collection and curation of monolingual and bilingual data: focus on under resourced languages
Obsežno zbiranje in kuriranje eno- in dvojezičnih podatkov s poudarkom na manj podprtih jezikih

No. of contract:

Type of project:
from 01.06.2021 to 30.09.2023

This Action aims to improve machine translation output quality by extending and enhancing the quality of the data sets, especially for specific under-resourced languages. The Action builds upon previous CEF-funded Actions ParaCrawl and EuroPat, H2020 project ‘GoURMET’ and the FP7 MSCA project ‘Abu-MaTran’.

Within the Action, new monolingual and parallel data will be acquired and enriched for the following under-resourced languages: Maltese, Slovenian, Croatian, Bulgarian, Turkish, Serbian, Montenegrin, Macedonian, Albanian and Icelandic. Text classification will be used to identify the appropriateness of parallel and monolingual data for the ten DSI categories for which the ELRC repository contains data: e-Health, e-Justice, Online Dispute Resolution, Europeana, Open Data Portal, Business Registers Interconnection System, e-Procurement, Safer Internet, Cybersecurity, and EESSI.

As a result, the Action will extend the data in ELRC-Share and focus on DSI-specific data to align with the automated production and configuration of text translation engines tailored to the needs of online public services in specific domains. Finally, by enriching the data, the Action will contribute to the collection of language resources through ELRC-SHARE to improve the quality of the machine translation services offered by CEF AT.