J2-9230

IMPERATRIX - Improving Reproducibility of Experiments and Reusability of Research Outputs in Complex Data Analysis
Izboljšanje ponovljivosti eksperimentov in večkratne uporabe raziskovalnih izsledkov pri analizi kompleksnih podatkov

No. of contract:

J2-9230
Duration:
from 01.07.2018 to 30.06.2022

Contact:

The advances in science are heavily based on the premise of the concept of a trusted discovery, provided that the preformed research is done correctly, and reproducible by other scientists. In order to increase the reusability of research outputs, such as developed models and produced data, they should be Findable, Accessible, Interoperable and Reusable (FAIR principles). The main point of the FAIR is to ensure that research outputs are reusable and will actually be used by others, thus becoming more valuable. The DG for Research and Innovation of the EC has adopted the reusability of research data as one of their priorities, which provided the rapid endorsement of the FAIR principles by different stakeholders. The research outputs that wish to fulfill the FAIR principles must be represented with a wide accepted machine-readable framework. Currently, a popular solution to data sharing that fulfills the FAIR requirements is the use of semantic web technologies.

Complex data analysis methods, originating from machine learning (ML) and data mining (DM), are increasingly being used in applications from various domains of science (e.g., life sciences, space research, etc). In order to provide reproducibility of experiments (e.g., executions of methods) and reuse of research outputs (e.g., predictive models), one needs to formally describe the entities involved in the process of analysis, and store them together with their descriptions (e.g., metadata) as a digital objects in a database like structure. Having a “semantically aware” stores of entities for complex data analytics enhanced with automatic reasoning capabilities, would be beneficial for improving reproducibility of experiments and reuse of research outputs. In this way we would move closer towards a FAIR data analysis process.

The main objective of the proposed project is to improve the repeatability of experiments and reusability of research outputs in complex data analysis. We will address this objective by combining approaches and ideas from the areas of complex data analytics, ontologies for science, semantic web and inductive databases. More specifically, we will develop a modular system for executing complex data analysis experiments, and semantically annotating, storing, querying and reusing their outputs. To meet the project main objective, we plan to:

  1. design, implement and populate ontologies for complex data analysis to be used for semantic annotation;
  2. design and implement a prototype system for storing semantically annotated data, experiments and models;
  3. develop querying strategies and test the querying capabilities of the prototype system, and
  4. test the developed system in different use-case scenarios from various domains, such as machine learning, lifesciences, space research and chemoinformatics.

The proposed research will significantly advance the state-of-the-art in the general area of computer science, the specific area of machine learning and data mining, and particularly for the topic of complex data analytics. It will develop new architecture for semantically aware experimentation. It will also improve storing, reusing, revising and querying of models produced by the analytics methods. This is of particular particular importance for the application domains that heavily use data analytics tools in their work. The proposed project will also have a large impact in the context of automating data science. The experiments would be repeatable, since they are performed in a sound documented fashion, as there will be an architecture available to perform such an analysis. Current experimentation architectures are applicable to a very limited set of tasks and do not deal with querying, collaborative validation and revision of models, which represents a serious development bottleneck. Finally, in a wider societal context, the project will increase Slovenia’s research and innovation potential in this area of extreme practical importance.