Research Areas ǀ Language technologies and digital humanities

Icons-04

In these areas, we are addressing the fields of natural language processing and understanding, text and network analytics, open access language resources, and digital humanities.

In the field of natural language processing and understanding, we are trying to solve many problems for the news media industry concerning the analysis of news and comments, especially by leveraging innovations in the use of cross-lingual embeddings coupled with deep neural networks, allowing existing monolingual resources to be used across languages. We are working on machine learning methods for NLP. We are developing explanation techniques for neural classifiers by extending SHAP explanations, or by self-attention analysis. We are developing an autoML approach, autoBOT, in which we use an evolutionary algorithm to jointly optimize various sparse and dense representations for a given text classification task, and applied subgroups discovery methods for understanding news sentiment. We are also developing new methods for the fundamental NLP task of semantic parsing, using two approaches: one based on incremental parsing using vector-space models, designed to be suitable for dialogue processing, and one based on large pre-trained neural models using simplified intermediate representations, which achieved new state-of-the-art results for parsing natural language text to SQL queries for database search. We are also developing methods for understanding the structure of dialogue and interaction in large groups, based on neural NLP and social network analysis, to understand the nature of explanation, decision-making, and influence in large organizations.

In the field of text and network analytics, our research approach is to combine methods of text mining, natural language processing, network analysis, and topic detection to reveal and highlight underlying characteristics in different domains. The main sources of data that we analyze are social media (Twitter, Facebook, YouTube). We are also developing models for automated hate speech detection and tracking. We are also comparing various methods for forward-looking sentence extraction from annual reports and contributed to the FinSim-2 task on financial concept classification. Cross-lingual embeddings can be used to transfer machine learning models between languages, thereby compensating for insufficient data in less-resourced languages. We are using cross-lingual word embeddings to transfer machine learning prediction models for Twitter sentiment between languages. Our experiments show that the transfer of models between similar languages is sensible, while dataset expansion did not increase the predictive performance. Similarly, we can address the task of offensive language detection in zero-shot and few-shot learning where no or only a few examples of training data in the target language are available. Finally, the transfer learning approach is applied for the task of diachronic semantic change detection and to explore scientific discourse on the topic of ecosystem services.

In the field of open access language resources, we are leading CLARIN.SI, the Slovenian national node of the European CLARIN ERIC research infrastructure, provides easy publication and sustainable access to digital language data for scholars in the humanities and social sciences and other disciplines that use or produce language resources. CLARIN.SI maintains the CTS-certified CLARIN.SI repository, concordancers, and other Web services, and support the creation of language resources and promotion of digital linguistics.

We are also active in the field of digital humanities.

Projects in the field of language technologies and digital humanities:

BI-FR/23-24-PROTEUS-006

Cross-lingual and cross-domain methods for terminology extraction and alignment, 1. 1. 2023-31. 12. 2024, Senja Pollak

BI-US/22-24-170

Working Memory based assessment of Large Language Models, 01.07.2022 - 30.06.2024, Senja Pollak

Development of natural language processing prototype for sentiment analysis, keyword detection and clustering of news articles

ParlaMINT II

Towards Comparable Parliamentary Corpora, 01.12.2021 - 31.05.2023, Tomaž Erjavec, Nikola Ljubešić

RobaCOFI

Robust and adaptable comment filtering, 01.03.2022-28.02.2023, Senja Pollak, Matthew Purver

P2-0103

Knowledge technologies, 1.1.2022 - 31.12.2027, Sašo Džeroski

J5-3102

Hate speech in contemporary conceptualizations of nationalism, racism, gender and migration, 2021-2024, Senja Pollak

J6-3131

Formant combinatorics in Slovenian, 1.10.2021-30.9.2024, Senja Pollak, Tomaž Erjavec
MaCoCu

MaCoCu

Massive collection and curation of monolingual and bilingual data: focus on under resourced languages, 2021-2031, Nikola Ljubešič

J6-2579

Tradition and Innovation: Traditional Paremiological Units in Dialogue with Contemporary Use, 1.9.2020-31.8.2023, Tomaž Erjavec