J6-7094 | Department of Knowledge Technologies

No. of contract:

J6-7094

Type of project:

Basic ARIS Projects | National Projects

Duration:

from 01.01.2016 to 31.12.2018

Contact:

Tomaž Erjavec

Areas:

Language Tehnologies and Digital Humanities

The development and use of Slovene academic language at universities and in research is one of the central questions of the Slovene language policy. The problem is highlighted in the National Programme for Language Policy of the Republic of Slovenia 2014–2018 and a number of European studies also draw attention to the impact that the knowledge and development of academic discourse have on language vitality. It is therefore of fundamental importance to develop contemporary reference language resources that will help empower Slovene academic language and to undertake comprehensive research based on a representative sample of such language.

Slovene universities have established institutional repositories of scientific publications, containing various types of texts from PhD theses to scientific and professional papers. An important milestone is the establishment of the National Portal for Open Science, launched in 2013, which aggregates access to the digital libraries of individual universities and other institutions. The portal offers access to over 123,000 Slovene language publications from a wide range of disciplines. These publications are a highly valuable but have, so far completely unused source of data on Slovene academic writing, including terminological data.

The goal of the project was to overcome these limitations in several ways. First, compiled a corpus of Slovene academic writing containing PhD, MSc/MA and BSc/BA theses harvested from the Open Science portal. The texts were extracted from their source PDF format, which involved developing methods for text clean-up and structure extraction, and up-conversion to a uniform and standardised TEI representation. The corpus was linguistically annotated, with new tools and resources developed to improve the quality of the annotations.

The corpus served as the basis for studies in terminology extraction. The extracted term candidates will be exported to a public online dictionary viewer and editor, so that Slovene scientific communities from a range of subject fields will be able to engage in the management of their terminologies. An important aspect of the work undertaken in the project was the first empirically based study of Slovene academic discourse, founded on the compiled corpus. Data usability studies and in-depth interviews were also be conducted in an attempt to determine the process and obstacles for academic writing in Slovene.

The project made its results as widely available as possible: the produced language resources and tools are made freely and openly available to the wider research community, which also improves the state-of-the art of corpus linguistics, digital humanities, and language technologies for Slovene. The complete corpus, as well as its three subcorpora are available for analysis via the CLARIN.SI concordancers and for download from the CLARIN.SI repository:

Corpus KAS (complete corpus): http://hdl.handle.net/11356/1244
Corpus KAS-dr (PhD theses): http://hdl.handle.net/11356/1265
Corpus KAS-mag (MSc/MA theses): http://hdl.handle.net/11356/1266
Corpus KAS-dipl (BSc/BA theses): http://hdl.handle.net/11356/1267

The project was conducted by ten researchers from four academic institutions with distinct but complementary expertise to attain its goals: to strengthen Slovene academic language; to make Slovene better equipped for functioning in the information society; and to promote open dissemination of scientific results.

The project is presented in the publication:

ERJAVEC, Tomaž, FIŠER, Darja, LJUBEŠIĆ, Nikola. The KAS corpus of Slovenian academic writing. Language Resources & Evaluation, 2020. https://doi.org/10.1007/s10579-020-09506-4.

The paper is freely available for reading at https://rdcu.be/b7GrB.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.