Research Areas ǀ Language technologies and digital humanities

Icons-04

In these areas, we are addressing the fields of natural language processing and understanding, text and network analytics, open access language resources, and digital humanities.

In the field of natural language processing and understanding, we are trying to solve many problems for the news media industry concerning the analysis of news and comments, especially by leveraging innovations in the use of cross-lingual embeddings coupled with deep neural networks, allowing existing monolingual resources to be used across languages. We are working on machine learning methods for NLP. We are developing explanation techniques for neural classifiers by extending SHAP explanations, or by self-attention analysis. We are developing an autoML approach, autoBOT, in which we use an evolutionary algorithm to jointly optimize various sparse and dense representations for a given text classification task, and applied subgroups discovery methods for understanding news sentiment. We are also developing new methods for the fundamental NLP task of semantic parsing, using two approaches: one based on incremental parsing using vector-space models, designed to be suitable for dialogue processing, and one based on large pre-trained neural models using simplified intermediate representations, which achieved new state-of-the-art results for parsing natural language text to SQL queries for database search. We are also developing methods for understanding the structure of dialogue and interaction in large groups, based on neural NLP and social network analysis, to understand the nature of explanation, decision-making, and influence in large organizations.

In the field of text and network analytics, our research approach is to combine methods of text mining, natural language processing, network analysis, and topic detection to reveal and highlight underlying characteristics in different domains. The main sources of data that we analyze are social media (Twitter, Facebook, YouTube). We are also developing models for automated hate speech detection and tracking. We are also comparing various methods for forward-looking sentence extraction from annual reports and contributed to the FinSim-2 task on financial concept classification. Cross-lingual embeddings can be used to transfer machine learning models between languages, thereby compensating for insufficient data in less-resourced languages. We are using cross-lingual word embeddings to transfer machine learning prediction models for Twitter sentiment between languages. Our experiments show that the transfer of models between similar languages is sensible, while dataset expansion did not increase the predictive performance. Similarly, we can address the task of offensive language detection in zero-shot and few-shot learning where no or only a few examples of training data in the target language are available. Finally, the transfer learning approach is applied for the task of diachronic semantic change detection and to explore scientific discourse on the topic of ecosystem services.

In the field of open access language resources, we are leading CLARIN.SI, the Slovenian national node of the European CLARIN ERIC research infrastructure, provides easy publication and sustainable access to digital language data for scholars in the humanities and social sciences and other disciplines that use or produce language resources. CLARIN.SI maintains the CTS-certified CLARIN.SI repository, concordancers, and other Web services, and support the creation of language resources and promotion of digital linguistics.

We are also active in the field of digital humanities.

Projects in the field of language technologies and digital humanities:

GC-0002

Large Language Models for Digital Humanistics (LLM4DH), 1. 10. 2024 - 30. 9. 2027, Senja Pollak

NetWordS

ELEXIS

ELEXIS

European lexicographic infrastructure, 1.12.2019 - 31.12.2020, Tomaž Erjavec

J5-50169

Linguistic Accessibility of Social Assistance Rights in Slovenia, 1.10.2023 - 30.9.2026, Senja Pollak

J7-4642

Basic Research for the Development of Spoken Language Resources and Speech Technologies for the Slovenian Language, 1.10.2022 - 30.9.2025, Nikola Ljubešić

BI-FR/23-24-PROTEUS-006

Cross-lingual and cross-domain methods for terminology extraction and alignment, 1. 1. 2023-31. 12. 2024, Senja Pollak

BI-US/22-24-170

Working Memory based assessment of Large Language Models, 01.07.2022 - 30.06.2024, Senja Pollak

Development of natural language processing prototype for sentiment analysis, keyword detection and clustering of news articles

ParlaMINT II

Towards Comparable Parliamentary Corpora, 01.12.2021 - 31.05.2023, Tomaž Erjavec, Nikola Ljubešić
RobaCOFI

RobaCOFI

Robust and adaptable comment filtering, 01.03.2022-28.02.2023, Senja Pollak, Matthew Purver

P2-0103

Knowledge technologies, 1.1.2022 - 31.12.2027, Sašo Džeroski

J5-3102

Hate speech in contemporary conceptualizations of nationalism, racism, gender and migration, 2021-2024, Senja Pollak

J6-3131

Formant combinatorics in Slovenian, 1.10.2021-30.9.2024, Senja Pollak, Tomaž Erjavec
MaCoCu

MaCoCu

Massive collection and curation of monolingual and bilingual data: focus on under resourced languages, 2021-2031, Nikola Ljubešič
J6-2579

J6-2579

Tradition and Innovation: Traditional Paremiological Units in Dialogue with Contemporary Use, 1.9.2020-31.8.2023, Tomaž Erjavec
J6-2581

J6-2581

Computer-assisted multilingual news discourse analysis with contextual embeddings, 1.9.2020-31.12.2023, Senja Pollak
J5-2554

J5-2554

Quantitative and qualitative analysis of the unregulated corporate financial reporting, 1.9.2020-31.8.2023, Senja Pollak, Martin Žnidaršič
RSDO

RSDO

Development of Slovene in a Digital Environment (DSDE), 1.5.2020-28.2.2023, Tomaž Erjavec
IMSyPP

IMSyPP

Innovative Monitoring Systems and Prevention Policies of Online Hate Speech, 1.3.2020-31.5.2022, Petra Kralj Novak, Igor Mozetič

RI-SI CLARIN

Development of research Infrastructure for the international competitiveness of Slovenian RRI space 22.7.2019-31.08.2021, Tomaž Erjavec
N6-0099

N6-0099

The linguistic landscape of hate speech on social media, 1.3.2019-28.2.2023, Tomaž Erjavec
EMBEDDIA

EMBEDDIA

Cross-Lingual Embeddings for Less-Represented Languages in European News Media, 1.1.2019-31.12.2021, Senja Pollak, Nada Lavrač
J6-9372

J6-9372

Terminology and Knowledge Frames across Languages, 1.7.2018-30.6.2021, Senja Pollak
Distant reading for European Literary History

Distant reading for European Literary History

1.1.2018-31.12.2021, Tomaž Erjavec
J6-8255

J6-8255

Collocations as a Resource for Language Description: Semantic and Temporal Aspects, 1.5.2017-30.4.2020, Nikola Ljubešić
J7-8280

J7-8280

Resources, methods and tools for the understanding, identification and classification of various forms of socially unacceptable discourse in the information society, 1.5.2017-30.4.2020, Tomaž Erjavec

TermIOLAR2

Prototype program solution for extraction and alignment of terminology from parallel corpora of translation memories, 1.3.2017-15.8.2017, Senja Pollak
ReLDI

ReLDI

Regional Linguistic Data Initiative, 1.4.2016-31.12.2017, Nikola Ljubešič, Tomaž Erjavec

L6-7134

Forbidden Books in the Slovenian Lands in the Early Modern Period, 1.1.2016-31.12.2018, Tomaž Erjavec
J6-7094

J6-7094

Slovene scientific texts: resources and description, 1.1.2016-31.12.2018, Tomaž Erjavec

TermIOLAR1

Development of a prototype program solution for support of semi-automatic extraction and management of terminology in monolingual and multilingual corpora, 13.10.2015-30.6.2017, Senja Pollak

P2-0103

Knowledge technologies, 1.1.2015-28.2.2021, Nada Lavrač

BI-RS/14-15-068

The construction of corpora and lexica of nonstandard Serbian and Slovenian, 1.11.2014-31.12.2015, Tomaž Erjavec
J6-6842

J6-6842

JANES - Resources, Tools and Methods for the Research of Nonstandard Internet Slovene, 1.7.2014-30.6.2017, Tomaž Erjavec, Darja Fišer

BI-HR/14-15-047

Constructing a Bilingual Lexicon of Closely Related Languages From Existing Language Resources, 1.5.2014-31.12.2015, Tomaž Erjavec
NRSS2

NRSS2

Slovenian Literature in Unknown Early Modern Manuscripts. InformationTechnology Aided Analyses and Scholarly Editions, 1.8.2013-31.7.2016, Tomaž Erjavec

PARSEME

Parsing and multi-word expressions, 8.3.2013-7.3.2017, Tomaž Erjavec
CLARIN.SI

CLARIN.SI

Research Infrastructure for language resources and tools, 1.1.2013-31.12.2020, Tomaž Erjavec
J6-4019

J6-4019

The leading humanists in the Slovenian territory between the 16th and mid-19th centuries and their social and cultural environment, 1.7.2011-30.6.2014, Tomaž Erjavec
ESF NetWordS

ESF NetWordS

The European Network on Word Structure, 1.5.2011-1.6.2015, Tomaž Erjavec

MUMIA

Multilingual and Multifaceted Interactive Information Access, 28.7.2010-29.11.2014, Igor Mozetič, Tomaž Erjavec
IMPACT

IMPACT

Improving Access to Text, 1.4.2010-30.6.2012, Tomaž Erjavec

J6-2009

Slovene translation studies - resources and research, 1.5.2009-30.4.2012, Tomaž Erjavec

P2-0103

Knowledge technologies, 1.1.2009-31.12.2014, Nada Lavrač

BI-FR/09-10-PROTEUS-015

Definition of syntactic-semantic structure of Slovene verb, 1.1.2009-31.12.2010, Tomaž Erjavec
FlaReNet

FlaReNet

Fostering Language Resources Network, 1.9.2008-31.12.2011, Tomaž Erjavec

BI-JP/08-10/006

Japanese-Slovene resources for students of Japanese, 1.4.2008-31.3.2010, Tomaž Erjavec

XML_FED

Preparation of specifications for the use of VoiceXML technologies in XML_Filler application connected to the Jaws screen reader, 1.3.2008-15.10.2008, Tomaž Erjavec

L6-0163

Unknown 17th and 18th century manuscripts of Slovenian literature: information-technology aided register, scholarly editions and analyses, 1.2.2008-15.3.2012, Tomaž Erjavec

SEE-ERA.NET

Building Language Resources and Translation Models for Machine Translation focused on South Slavic and Balkan Languages, 1.10.2007-30.6.2008, Tomaž Erjavec

AhLib-Web

Web service for Corpus of XIX century translated books, 1.5.2007-1.5.2009, Tomaž Erjavec

V2-0380

Digital text centre with multimedia communication, 1.4.2007-31.3.2009, Tomaž Erjavec

J2-9180

Linguistic annotation of Slovene language: methods and resources, 1.1.2007-31.12.2009, Tomaž Erjavec

M2-0132

Multilingual mobile speech communicator for 21.th century warriors 1.1.2006-30.11.2008, Tomaž Erjavec

V6-0121

VIP airport transfers to Icmeler from Dalaman airport 1.9.2004-31.8.2006, Tomaž Erjavec

M2-0019

Multilingual mobile speech communicator for 21.th century warriors, 15.8.2004-14.8.2006, Tomaž Erjavec

L6-6373

Digital Critical Editions of Slovene Literature 1.7.2004-30.6.2007, Tomaž Erjavec

V2-0894

Setting up resources and systems for simultaneous sloven-english translation 1.1.2004-31.12.2005, Tomaž Erjavec

P2-0103

Knowledge technologies 1.1.2004-31.12.2008, Nada Lavrač

Development of linguistic resources for machine translation between Slovene and Serbian

Scientific and technological cooperation between the Republic of Slovenia and Serbia and Montenegro 1.1.2004-31.12.2005, Tomaž Erjavec