JOS - Linguistic annotation of Slovene language: methods and resources
Jezikoslovno označevanje slovenskega jezika: metode in viri

from 01.01.2007 to 31.12.2009


The project will develop automatic inductive methods and tools for morphosyntactic, syntactic and semantic annotation, which will be used for building manually corrected and publicly accessible Slovene language resources, namely annotated corpora and lexicons. These results will provide the urgently needed infrastructure for further development of language technologies for Slovene. As these resources will be accessible not only to the project members, but to any research team in Slovenia and abroad, they are expected to act as a catalyst for R&D in the field of language technologies for the Slovene language, an area that is of vital importance for effective use of Slovene in the Information Society. The project comprises four work packages. The first horizontal work package addresses technical and legal aspects of resource accessibility, i.e. making resources available to developers for use as learning and testing datasets, and to linguists for research on Slovene. The remaining three work packages are concerned with three levels of linguistic analysis. The first is morphosyntactic tagging and the related lemmatization, which is the basic level of annotation indispensable to virtually every language-oriented computer program; the project will improve on existing methods and produce an annotated corpus, manually checked for errors. The second level comprising automatic syntactic analysis is of key importance for in-depth text analyses, since it reveals the interdependence of syntactic units. The project will produce a syntactically annotated corpus and a valency lexicon, both hand corrected, and a syntactic parser for Slovene. The last level deals with lexical semantics of Slovene, needed e.g. in machine translation and information search. The project will upgrade the existing semantic lexicon (ontology) for Slovene, annotate a corpus using concepts from this lexicon and develop methods for automatic ontology building and disambiguation of polysemous lexemes. The project will draw on ample experience of the project partners in the development of Slovene language resources and machine learning. The point of departure will be the morphosyntactically annotated reference corpus Fida PLUS, the syntactically annotated prototype corpus SDT and the prototype semantic lexicon sloWNet. Work in the project will be closely tied to simultaneous Slovene and EU projects concerned with the development of machine learning methods for machine translation and ontology building.