IMPERATRIX

Improving Reproducibility of Experiments and Reusability of Research Outputs in Complex Data Analysis

Project duration

1. 7. 2018 - 30. 6. 2021

Contractor

Financed by

The Basic Research Project was selected for financing from the Public call for the (co-)financing of research projects in 2018 by the Slovenian Research Agency.

Abstract

The advances in science are heavily based on the premise of the concept of a trusted discovery, provided that the preformed research is done correctly, and reproducible by other scientists. In order to increase the reusability of research outputs, such as developed models and produced data, they should be Findable, Accessible, Interoperable and Reusable (FAIR principles). The main point of the FAIR is to ensure that research outputs are reusable and will actually be used by others, thus becoming more valuable. The DG for Research and Innovation of the EC has adopted the reusability of research data as one of their priorities, which provided the rapid endorsement of the FAIR principles by different stakeholders. The research outputs that wish to fulfill the FAIR principles must be represented with a wide accepted machine-readable framework. Currently, a popular solution to data sharing that fulfills the FAIR requirements is the use of semantic web technologies.

Complex data analysis methods, originating from machine learning (ML) and data mining (DM), are increasingly being used in applications from various domains of science (e.g., life sciences, space research, etc). In order to provide reproducibility of experiments (e.g., executions of methods) and reuse of research outputs (e.g., predictive models), one needs to formally describe the entities involved in the process of analysis, and store them together with their descriptions (e.g., metadata) as a digital objects in a database like structure. Having a “semantically aware” stores of entities for complex data analytics enhanced with automatic reasoning capabilities, would be beneficial for improving reproducibility of experiments and reuse of research outputs. In this way we would move closer towards a FAIR data analysis process.

The main objective of the proposed project is to improve the repeatability of experiments and reusability of research outputs in complex data analysis. We will address this objective by combining approaches and ideas from the areas of complex data analytics, ontologies for science, semantic web and inductive databases. More specifically, we will develop a modular system for executing complex data analysis experiments, and semantically annotating, storing, querying and reusing their outputs. To meet the project main objective, we plan to: (1) design, implement and populate ontologies for complex data analysis to be used for semantic annotation; (2) design and implement a prototype system for storing semantically annotated data, experiments and models; (3) develop querying strategies and test the querying capabilities of the prototype system, and (4) test the developed system in different use-case scenarios from various domains, such as machine learning, lifesciences, space research and chemoinformatics.

The proposed research will significantly advance the state-of-the-art in the general area of computer science, the specific area of machine learning and data mining, and particularly for the topic of complex data analytics. It will develop new architecture for semantically aware experimentation. It will also improve storing, reusing, revising and querying of models produced by the analytics methods. This is of particular particular importance for the application domains that heavily use data analytics tools in their work. The proposed project will also have a large impact in the context of automating data science. The experiments would be repeatable, since they are performed in a sound documented fashion, as there will be an architecture available to perform such an analysis. Current experimentation architectures are applicable to a very limited set of tasks and do not deal with querying, collaborative validation and revision of models, which represents a serious development bottleneck. Finally, in a wider societal context, the project will increase Slovenia’s research and innovation potential in this area of extreme practical importance.

Project Team

doc. dr. Panče Panov - project leader

Postdoctoral researcher at the Jožef Stefan Institute

Panče Panov is a postdoctoral researcher at the Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia. He completed his PhD in 2012 in the area of data mining at the Jožef Stefan International Postgraduate School, Ljubljana, Slovenia. His thesis concerned the design and implementation of a modular ontology for the domain of data mining. His research interests are related to machine learning, data mining, the knowledge discovery process, and applying ontology in these domains. His contributions include developments of ontologies for describing the domain of data mining and the process of knowledge discovery, which can be employed in various applications. He was actively involved in several EU-funded projects in the past (IQ, SUMO) and is currently involved in the MAESTRA project. In addition, he participated in several projects financed by the Slovenian research agency and one bilateral project between Slovenia and Croatia. He is co-editor of the book entitled “Inductive databases and constraint-based data mining” published in 2010 by Springer. In 2014, he was program co-chair of the International Conference on Discovery Science (2014) and co-editor of the proceedings of the conference published by Springer. Finally, in 2015 he a co-editor of a special issue of the Journal Machine Learning on Discovery Science.

dr. Dragi Kocev

Research associate at the Jožef Stefan Institute

Dr. Dragi Kocev is a research fellow at the Jožef Stefan Institute. His major research interests are development of methods for predicting structured outputs that achieve state-of-the art predictive performance through various application areas (e.g., gene function prediction, drug repurposing, energy consumption, vegetation and habitat modeling). He has an extensive bibliography that includes article in top journal in the domain of machine learning as well as articles in the application domains. He was also a co-coordinator of the FP7 FET project MAESTRA, as well as member of the winning team of the Mars Express data analysis challenge organized by European Space Agency (ESA). His expertise will be essential in all tasks in WP 4 as well as in WP 1.

dr. Nikola Simidjievski

Postdoctoral researcher at the Jožef Stefan Institute

Dr. Nikola Simidjievski is a postdoctoral researcher at the Jožef Stefan Institute. His major research interest is in development of methods for automated modelling of complex dynamic systems and their application to various domains, such as systems ecology, systems biology, systems medicine and systems neuroscience. He also has extensive knowledge of the use and operation of HPC systems. He was a key member of the winning team of the Mars Express data analysis challenge organized by European Space Agency (ESA). His expertise will be essential in the use cases in WP 4.

doc. dr. Petra Kralj Novak

Postdoctoral researcher at the Jožef Stefan Institute

Dr. Petra Kralj Novak is a postdoc at the Jožef Stefan Institute. Her background is in computer science and knowledge discovery from databases. Her current research interests are in analyses of social and mainstream media focusing on the mediated stance and sentiment, misinformation and reputation manipulation. Her research is published in main machine learning and interdisciplinary journals and conferences. She is skilled in SQL and No-SQL database solutions including graph databases (Neo4J) and document databases (ElasticSearch, MongoDB). She contributed to many national and European research projects, including Dolfins, Simpol, and Multiplex. Her expertise will be essential in WP 2.

prof. dr. Sašo Džeroski

Scientific councillor at the Jožef Stefan Institute

Prof. Dr Sašo Džeroski is a scientific councillor at the Jožef Stefan Institute and the CIPKeBiP centre of excellence, both in Ljubljana, Slovenia. He is also a full professor at the Jozef Stefan International Postgraduate School. His research is mainly in the area of machine learning and data mining and their applications. He is co-author/co-editor of more than ten books/volumes. He has participated in many international research projects (mostly EU-funded) and coordinated several of them of them in the past (FP6 FET IQ, FP7 FET MAESTRA ). Currently he is one of the principal investigators in the FET Flagship Human Brain Project and the Interreg project TRAIN, as well as leading two national projects funded by the national research agency ARRS. His extensive research and management expertise will be valuable in large number of tasks, especially in WP 1, 3 and 4.

Ana Kostovska

Masters student at the Jožef Stefan International Postgraduate School

Ilin Tolovski

Masters student at the Jožef Stefan International Postgraduate School

Project phases

The main goals of the proposed project are to improve reproducibility of experiments and increase the reusability of research outputs in complex data analysis. In order to achieve the goals, we will combine approaches, ideas and state-of-the art technologies from the areas of complex data analytics, ontologies for science, semantic web and inductive databases.

Phase One

In the first phase of the project, we will focus on design and implementation of ontologies and population of a knowledge base for complex data analysis. This will include competency and requirements analysis in which we will take into account the types of queries and the interactions with the system defined by the use scenarios. Furthermore, we will also reuse and building on top of our previous work and use ontology engineering best practices, in order to provide compatibility with other resources. The produced ontologies and the knowledge base will be a backbone of our system as they will provide means for semantic annotation, querying, and semantic inference, and will define the schema of the data, experiment and model stores. Finally, the major goal of the first phase will be to obtain a vocabulary for semantic annotation of data, experiments and models, based on the types of queries that we want our system to answer.

Phase Two

In the second phase of the project, we will design and implement an architecture for semantic annotation of data, experiments and models, by using semantic web technologies, and storing the semantically annotated entities in semantic stores. This will first involve the task of identifying the adequate database system to use for the stores (e.g., Relational or NoSQL). Furthermore, we will need to choose a formalism (e.g., PMML, PFA) for representing and storing of models produced by the complex data analytics systems. This is especially important for the task of execution, validation and revision of models by using new data. Finally, the major goal of the second phase of the project is to obtain a working prototype of the semantic stores and for this purpose we will develop a set of testing scenarios for populating the semantic stores.

Phase Three

In the third phase of the project, we will design, implement and test several querying strategies, such as querying the asserted knowledge, querying the inferred knowledge, federated querying and inductive querying. The first strategy will involve posing queries to the individual stores with the aim to query only the asserted knowledge, present in the stores at the time of the query. The second strategy will involve posing queries to the individual stores with use of semantic inference service. In this way, one can obtain results that take into account the inferred knowledge, by using axioms defined in the ontologies and the knowledge base. The third strategy will involve federated queries that involve querying different stores at the same time and combining the results. Federated querying has two main benefits: it scales easily and make data management simpler. The forth strategy will involve posing inductive queries. Here, a domain user would input his data analysis task, formulated in a declarative fashion, and the system, if necessary, would automatically generate and run experiments, query the stores and return the result. The advantage of this declarative specification of the analysis task is that the user doesn’t need to specify which method to use. Finally, this phase will partially overlap with the scenarios for testing the semantic stores from phase two.

Phase Four

Finally, in the fourth phase of the project, we will demonstrate different aspects of repeatable and reusable complex data analysis by using the prototype semantic stores, build in phase two, and querying strategies, designed in phase three, in several use-cases scenarios from the areas of machine learning, space research, life sciences and chemoinformatics. This phase will partially overlap with the second and the third phase, since the machine learning use case will be used in the for testing the semantic stores and testing different querying strategies. We envision that the other domain use cases can be done in a collaboration with other stakeholders, such as partners from other projects (e.g., H2020 Human Brain Project, Interreg TRAIN) or institutions with which collaborate, such as the European Space Agency. For example, we can make an instantiation of the proposed system for the task of biomarker discovery and discovering biological signatures of diseases. With this we will be able define scenarios for querying for analytics methods that satisfy a set of user constraints (e.g., find all methods that solve the task of biomarker discovery). Furthermore, we can will define scenarios for querying the semantic stores, which will facilitate the search of models that satisfy a set of user constraints (e.g., to find disease signatures involving some clinical score and some biomarker, find datasets that can be used to test the validity of a given biological signature of disease, etc.). At the end, we can also look at scenarios for revising biological signatures, regardless of whether they have come from domain experts or have been learned from data to begin with, in light of new data.

Workpackages

The work within the proposed project will be organized in five major work packages (WP).

WP 1 - Ontologies for complex data analytics

WP1 is the central work package of this project. The goal of this WP is to build ontologies and knowledge base for complex data analysis. The produced ontologies and the knowledge base will be a backbone of our system as they will provide means for semantic annotation, querying, and semantic inference, and will define the schema of the data, experiment and model stores. The planned work in this WP will be realized in four distinctive tasks.

  • T.1.1 Competency and requirement analysis
  • T.1.2 Design and implementation of ontologies for complex data analysis
  • T.1.3 Knowledge base for complex data analysis
  • T.1.4 Semantic annotation of data, experiments and models

WP 2 - Data, experiment and model stores

The goal of this WP is to design and implement semantic stores for storing the semantically annotated entities (e.g., data, experiments, models) provided by task T.1.4 from WP 1. In addition, in this WP, we will analyse different approaches for representation, storage, execution and revision of models. The planned work in this WP will be realized in four distinctive tasks.

  • T.2.1 Identification of adequate storage architecture
  • T.2.2 Strategies for representation, storing, executing and revision of models
  • T.2.3 Design and implementation of prototype data, experiment and model stores
  • T.2.4 Testing scenarios on prototype data, experiment and model stores

WP 3 - Queries on data, experiment and model stores

The goal of this WP is to develop and test different strategies for querying the endpoints of the prototype semantic stores, implemented in WP 2 and implementation of different semantic query services. The types of queries to be posed on the semantic stores will be based on the competency questions analysed in task T.1.1. The vocabulary for constructing the queries is provided by the ontologies and knowledge base for complex data analysis, build by WP 1. The query language used for querying the stores is dependent of the chosen database management system (e.g., if Apache Jena TDB store is used for storing the RDF facts, the store provides the Fuseki SPARQL server that allows querying the store using the SPARQL language). The querying strategies we will address in this WP will include querying the asserted knowledge from the individual stores, querying the inferred knowledge from the individual stores, federated querying and inductive querying. The planned work in this WP will be realized in four distinctive tasks.

  • T.3.1 Querying asserted knowledge from the stores
  • T.3.2 Querying inferred knowledge from the stores
  • T.3.3 Federated semantic queries of data, experiment and model stores
  • T.3.4 Inductive queries on data, experiment and model stores

WP 4 - Use cases

The goal of this WP is to test different components and aspects of the proposed architecture built in WP 1 - WP 3 on four use cases that use complex data analytics methods, originating from different domains, such as machine learning, space research and chemoinformatics. The use cases will be performed in the context of providing repeatable experiments and reusable research outputs and showing the advances of using the built architecture. The machine learning use case will start early in the project (M14) and will be used as a “test bed” in T.2.4 to test scenarios for populating the semantic stores, and in tasks T.3.1 - T.3.4 to test different strategies for querying the semantic stores. The other use cases will be realized after obtaining a stable semantic stores prototype and querying services. The planned work in this WP will be realized in four distinctive tasks.

  • T.4.1 Use case in machine learning
  • T.4.2 Use case in space research
  • T.4.3 Use case in life sciences
  • T.4.4 Use case in chemoinformatics

WP 5 - Dissemination, exploitation and management

The goal of this WP is to provide dissemination and exploitation of the project results and developed resources originating from WP 1 - WP 4. This will be done via the project web page, by organizing workshops and hackathons, presentations of the project results on conferences and workshops, and publishing scientific papers in top journals. Finally, this WP will provide information about the implementation and the results of the project to the Slovenian Research Agency. The planned work in this WP will be realized in four distinctive tasks.

  • T.5.1 Project web page
  • T.5.2 Organization of hackathons and workshops
  • T.5.3 Management of research outputs
  • T.5.4 Project reporting and management

Project results and resources

Bibliography