Research Areas ǀ Language technologies and digital humanities

Icons-04

In these areas, we are addressing the fields of natural language processing and understanding, text and network analytics, open access language resources, and digital humanities.

In the field of natural language processing and understanding, we are trying to solve many problems for the news media industry concerning the analysis of news and comments, especially by leveraging innovations in the use of cross-lingual embeddings coupled with deep neural networks, allowing existing monolingual resources to be used across languages. We are working on machine learning methods for NLP. We are developing explanation techniques for neural classifiers by extending SHAP explanations, or by self-attention analysis. We are developing an autoML approach, autoBOT, in which we use an evolutionary algorithm to jointly optimize various sparse and dense representations for a given text classification task, and applied subgroups discovery methods for understanding news sentiment. We are also developing new methods for the fundamental NLP task of semantic parsing, using two approaches: one based on incremental parsing using vector-space models, designed to be suitable for dialogue processing, and one based on large pre-trained neural models using simplified intermediate representations, which achieved new state-of-the-art results for parsing natural language text to SQL queries for database search. We are also developing methods for understanding the structure of dialogue and interaction in large groups, based on neural NLP and social network analysis, to understand the nature of explanation, decision-making, and influence in large organizations.

In the field of text and network analytics, our research approach is to combine methods of text mining, natural language processing, network analysis, and topic detection to reveal and highlight underlying characteristics in different domains. The main sources of data that we analyze are social media (Twitter, Facebook, YouTube). We are also developing models for automated hate speech detection and tracking. We are also comparing various methods for forward-looking sentence extraction from annual reports and contributed to the FinSim-2 task on financial concept classification. Cross-lingual embeddings can be used to transfer machine learning models between languages, thereby compensating for insufficient data in less-resourced languages. We are using cross-lingual word embeddings to transfer machine learning prediction models for Twitter sentiment between languages. Our experiments show that the transfer of models between similar languages is sensible, while dataset expansion did not increase the predictive performance. Similarly, we can address the task of offensive language detection in zero-shot and few-shot learning where no or only a few examples of training data in the target language are available. Finally, the transfer learning approach is applied for the task of diachronic semantic change detection and to explore scientific discourse on the topic of ecosystem services.

In the field of open access language resources, we are leading CLARIN.SI, the Slovenian national node of the European CLARIN ERIC research infrastructure, provides easy publication and sustainable access to digital language data for scholars in the humanities and social sciences and other disciplines that use or produce language resources. CLARIN.SI maintains the CTS-certified CLARIN.SI repository, concordancers, and other Web services, and support the creation of language resources and promotion of digital linguistics.

We are also active in the field of digital humanities.

Projects in the field of language technologies and digital humanities:

J6-2581

J6-2581

Computer-assisted multilingual news discourse analysis with contextual embeddings, 1.9.2020-31.12.2023, Senja Pollak

J5-2554

Quantitative and qualitative analysis of the unregulated corporate financial reporting, 1.9.2020-31.8.2023, Senja Pollak, Martin Žnidaršič
RSDO

RSDO

Development of Slovene in a Digital Environment (DSDE), 1.5.2020-28.2.2023, Tomaž Erjavec
IMSyPP

IMSyPP

Innovative Monitoring Systems and Prevention Policies of Online Hate Speech, 1.3.2020-31.5.2022, Petra Kralj Novak, Igor Mozetič

RI-SI CLARIN

Development of research Infrastructure for the international competitiveness of Slovenian RRI space 22.7.2019-31.08.2021, Tomaž Erjavec

N6-0099

The linguistic landscape of hate speech on social media, 1.3.2019-28.2.2023, Tomaž Erjavec
EMBEDDIA

EMBEDDIA

Cross-Lingual Embeddings for Less-Represented Languages in European News Media, 1.1.2019-31.12.2021, Senja Pollak, Nada Lavrač

J6-9372

Terminology and Knowledge Frames across Languages, 1.7.2018-30.6.2021, Senja Pollak
Distant reading for European Literary History

Distant reading for European Literary History

1.1.2018-31.12.2021, Tomaž Erjavec
J6-8255

J6-8255

Collocations as a Resource for Language Description: Semantic and Temporal Aspects, 1.5.2017-30.4.2020, Nikola Ljubešić