Bisociative Knowledge Discovery for
Microarray Data Analysis
Igor Mozetiˇc1, Nada Lavraˇc1
,
2, Vid Podpeˇcan1, Petra Kralj Novak1,
Helena Motaln3, Marko Petek3, Kristina Gruden3,
Hannu Toivonen4, Kimmo Kulovesi4
1 Joˇzef Stefan Institute, Jamova 39, Ljubljana, Slovenia
{igor.mozetic, nada.lavrac, vid.podpecan, petra.kralj}@ijs.si
2 University of Nova Gorica, Vipavska 13, Nova Gorica, Slovenia
3 National Institute of Biology, Veˇcna pot 111, Ljubljana, Slovenia
{helena.motaln, marko.petek, kristina.gruden}@nib.si
4 Department of Computer Science, University of Helsinki, Finland
{hannu.toivonen, kimmo.kulovesi}@cs.helsinki.fi
Abstract. The paper presents an approach to computational knowledge
discovery through the mechanism of bisociation. Bisociative reasoning
is at the heart of creative, accidental discovery (e.g., serendipity), and
is focused on finding unexpected links by crossing contexts. Contextualization
and linking between highly diverse and distributed data and
knowledge sources is therefore crucial for the implementation of bisociative
reasoning. In the paper we explore these ideas on the problem of
analysis of microarray data. We show how enriched gene sets are found
by using ontology information as background knowledge in semantic subgroup
discovery. These genes are then contextualized by the computation
of probabilistic links to diverse bioinformatics resources. Preliminary experiments
with microarray data illustrate the approach.
1 Introduction
Systems biology studies and models complex interactions in biological systems
with the goal of understanding the underlying mechanisms. Biologists collect
large quantities of data from wet lab experiments and high-throughput platforms.
Public biological databases, like Gene Ontology and Kyoto Encyclopedia
of Genes and Genomes, are sources of biological knowledge. Since the growing
amounts of available knowledge and data exceed human analytical capabilities,
technologies that help analyzing and extracting useful information from such
large amounts of data need to be developed and used.
The concept of association is at the heart of many of today’s ICT technologies
such as information retrieval and data mining (for example, association rule
learning is an established data mining technology, [1]). However, scientific discovery
requires creative thinking to connect seemingly unrelated information,
for example, by using metaphors or analogies between concepts from different
domains. These modes of thinking allow the mixing of conceptual categories and
190
contexts, which are normally separated. One of the functional basis for these
modes is the idea of bisociation, coined by Artur Koestler half a century ago [7]:
“The pattern . . . is the perceiving of a situation or idea, L, in two selfconsistent
but habitually incompatible frames of reference, M1 and M2.
The event L, in which the two intersect, is made to vibrate simultaneously
on two different wavelengths, as it were. While this unusual situation
lasts, L is not merely linked to one associative context but bisociated
with two.”
Koestler found bisociation to be the basis for human creativity in seemingly
diverse human endeavors, such as humor, science, and arts. The concept of bisociation
in science is illustrated in Figure 1.
Fig. 1. Koestler’s schema of bisociative discovery in science ([7], p. 107).
We are interested in creative discoveries in science, and in particular in computational
support for knowledge discovery from large and diverse sources of
data and knowledge. To this end, we participate in a European FP7 FET-Open
project BISON5 which investigates possible computational realizations of bisociative
reasoning. The project is based on the following, somewhat simplified,
assumptions:
– A bisociative information network (named BisoNet) can be created from
available resources. BisoNet is a large graph, where nodes are concepts and
edges are probabilistic relations. Unlike semantic nets or ontologies, the
graph is easy to construct automatically since it carries little semantics.
To a large extent it encodes just circumstantial evidence that concepts are
somehow related through edges with some probability.
5 http://www.bisonet.eu/
191
– Different subgraphs can be assigned to different contexts (frames of reference).
– Graph analysis algorithms can be used to compute links between distant
nodes and subgraphs in a BisoNet.
– A bisociative link is a link between nodes (or subgraphs) from different
contexts.
In this paper we thus explore one specific pattern of bisociation: long-range links
between nodes (or subgraph) which belong to different contexts. More precisely,
we say that two concepts are bisociated if:
– there is no direct, obvious evidence linking them,
– one has to cross contexts to find the link, and
– this new link provides some novel insight into the problem domain.
We have to emphasize that context crossing is subjective, since the user has to
move from his ‘normal’ context (frame of reference) to an habitually incompatible
context to find the bisociative link [2]. In terms of Koestler (Figure 1), a habitual
frame of reference (plane M1) corresponds to a BisoNet subgraph as defined
by a user or his profile. The rest of the BisoNet represents different, habitually
incompatible contexts (in general, there may be several planes M2). The creative
act here is to find links (m2) which lead ‘out-of-the-plane’ via intermediate,
bridging concepts (L). Thus, contextualization and link discovery are two of the
fundamental mechanisms in bisociative reasoning as implemented in BISON.
Finding links between seemingly unrelated concepts from texts was already
addressed by Swanson [10]. The Swanson’s approach implements closed discovery,
the so-called A-B-C process, where A and C are given and one searches
for intermediate B concepts. On the other hand, in open discovery [16], only A
is given. One approach to open discovery, RaJoLink [8], is based on the idea
to find C via B terms which are rare (and therefore potentially interesting) in
conjunction with A. Rarity might therefore be one of the criteria to select links
which lead out of the habitual context (around A) to known, but non-obviously
related concepts C via B.
In this paper we present an approach to bisociative discovery and contextualization
of genes which should help in the analysis of microarray data. The
approach is based on semantic subgroup discovery (by using ontologies as background
knowledge in microarray data analysis), and the linking of various publicly
available bioinformatics databases. This is an ongoing work, where some elements
of bisociative reasoning are already implemented: creation of the BisoNet
graph, identification of relevant nodes in a BisoNet, and computation of links
to indirectly related concepts. Currently, we are expanding the BisoNet with
textual resources from PubMed, and implementing open discovery from texts
through BisoNet graph mining. We envision that the open discovery process will
identify potentially interesting concepts from different contexts which will act as
the target nodes for the link discovery algorithms. Links discovered in this way,
crossing contexts, might provide instances of bisociative discoveries.
192
The currently implemented steps of bisociative reasoning are the following.
The semantic subgroup discovery step is implemented by the SEGS system [14].
SEGS uses as background knowledge data from three publicly available, semantically
annotated biological data repositories, GO, KEGG and NCBI. Based on
the background knowledge, it automatically formulates biological hypotheses:
rules which define groups of differentially expressed genes. Finally, it estimates
the relevance (or significance) of the automatically formulated hypotheses on
experimental microarray data. The link discovery step is implemented by the
Biomine system [9]. Biomine weakly integrates a large number of biomedical resources,
and computes most probable links between elements of diverse sources.
It thus complements the semantic subgroup discovery technology, due to the
explanatory potential of additional link discovery and Biomine graph visualization.
While this link discovery process is already implemented, our current
work is devoted to the contextualization of Biomine nodes for bisociative link
discovery.
The paper is structured as follows. Section 2 gives an overview of five steps
in exploratory analysis of gene expression data. Section 3 describes an approach
to the analysis of microarray data, using semantic subgroup discovery in the
context of gene set enrichment. A novel approach, a first attempt at bisociative
discovery through contextualization, composed of using SEGS and Biomine
(SEGS+Biomine, for short) is in Section 4. An ongoing experimental case study
is presented in Section 5. We conclude in Section 6 with plans for future work.
2 Exploratory gene analytics
This section describes the steps which support bisociative discovery, targeted
at the analysis of differentially expressed gene sets: gene ranking, the SEGS
method for enriched gene set construction, linking of the discovered gene set to
related biomedical databases, and finally visualization in Biomine. The schematic
overview is in Figure 2.
The proposed method consists of the following five steps:
1. Ranking of genes. In the first step, class-labeled microarray data is processed
and analyzed, resulting in a list of genes, ranked according to differential
expression.
2. Ontology information fusion. A unified database, consisting of GO6 (biological
processes, functions and components), KEGG7 (biological pathways)
and NCBI8 (gene-gene interactions) terms and relationships is constructed
by a set of scripts, enabling easy updating of the integrated database (details
can be found in [12]).
3. Discovering groups of differentially expressed genes. The ranked list
of genes is used as input to the SEGS algorithm [14], an upgrade of the
6 http://www.geneontology.org/
7 http://www.genome.jp/kegg/
8 ftp://ftp.ncbi.nlm.nih.gov/gene/GeneRIF/interaction sources
193
. . .
gene2:
gene3:
gene1: + +
++
geneN: − −
SEGS Biomine
Microarray data Enriched gene sets Contextualized genes.
Fig. 2. Microarray gene analytics proceeds by first finding candidate enriched gene sets,
expressed as intersections of GO, KEGG and NCBI gene-gene interaction sets. Selected
enriched genes are then put in the context of different bioinformatic resources, as
computed by the Biomine link discovery engine. The ’+’ and ’-’ signs under Microarray
data indicate over- and under-expression values of genes, respectively.
RSD relational subgroup discovery algorithm [3, 4, 13], specially adapted to
microarray data analysis. The result is a list of most relevant gene groups that
semantically explain differential gene expression in terms of gene functions,
components, processes, and pathways as annotated in biological ontologies.
4. Finding links between gene group elements. The elements of the discovered
gene groups (GO and KEGG terms or individual genes) are used to
formulate queries for the Biomine link discovery engine. Biomine then computes
most probable links between these elements and entities from a number
of public biological databases. These links help the experts to uncover unexpected
relations and biological mechanisms potentially characteristic for the
underlying biological system.
5. Gene group visualization. Finally, in order to help in explaining the discovered
out-of-the-context links, the discovered gene relations are visualized
using the Biomine visualization tools.
3 SEGS: Search for Enriched Gene Sets
The goal of the gene set enrichment analysis is to find gene sets which form
coherent groups and are distinguished from the rest of the genes. More precisely,
a gene set is enriched if the member genes are statistically significantly
differentially expressed as compared to the rest of the genes. Two methods for
testing the enrichment of gene sets were developed: Gene Set Enrichment Analysis
(GSEA) [11] and Parametric Analysis of Gene Set Enrichment (PAGE) [6].
Originally, these methods take individual terms from GO and KEGG (which
annotate gene sets), and test whether the genes that are annotated by a specific
term are statistically significantly differentially expressed in the given microarray
dataset.
194
Fisher
GSEA
PAGE
Microarray
data
Generation
of gene sets
Enriched
gene sets
GO
KEGG
NCBI
Ranking
of genes
Fig. 3. Schematic representation of the SEGS method.
The novelty of the SEGS method, developed by Trajkovski et al. [12,14] and
used in this study, is that the method does not only test existing gene sets for
differential expression but it also generates new gene sets that represent novel
biological hypotheses. In short, in addition to testing the enrichment of individual
GO and KEGG terms, this method tests the enrichment of newly defined gene
sets constructed by the intersection of GO terms, KEGG terms and gene sets
defined by taking into account also the gene-gene interaction data from NCBI.
The SEGS method has four main components:
– the background knowledge (the GO, KEGG and NCBI databases),
– the SEGS hypothesis language (the GO, KEGG and interaction terms, and
their conjunctions),
– the SEGS hypothesis generation procedure (generated hypotheses in the
SEGS language correspond to gene sets), and
– the hypothesis evaluation procedure (the Fisher, GSEA and PAGE tests).
The schematic workflow of the SEGS method is shown in Figure 3.
4 SEGS+Biomine: Contextualization of genes
We made an attempt at exploiting bisociative discoveries within the biomedical
domain by explicit contextualization of enriched gene sets. We applied two
methods that use publicly available background knowledge for supporting the
work of biologists: the SEGS method for searching for enriched gene sets [14]
and the Biomine method for contextualization by finding links between genes
and other biomedical databases [9]. We combined the two methods in a novel
way: we used SEGS for hypothesis generation in the form of interesting gene
sets, and then formulated queries for Biomine for out-of-the-context link discovery
and visualization (see Figure 4). We believe that by forming hypotheses
with SEGS, constructed as intersections of terms from different ontologies (different
contexts), discovering links between them by Biomine, and visualizing the
SEGS hypotheses and the discovered links by the Biomine graph visualization
engine, the interpretation of the biological mechanisms underlying differential
gene expression is easier for biologists.
195
Biomine
databases
Enriched
gene sets
Biomine Biomine graph
link discovery visualization
Fig. 4. SEGS+Biomine workflow.
In the Biomine9 project [9], data from several publicly available databases
were merged into a large graph, a BisoNet, and a method for link discovery
between entities in queries was developed. In the Biomine framework nodes correspond
to entities and concepts (e.g., genes, proteins, GO terms), and edges
represent known, probabilistic relationships between nodes. A link (a relation
between two entities) is manifested as a path or a subgraph connecting the corresponding
nodes.
Vertex Type Source Database Nodes Degree
Article PubMed 330,970 6.92
Biological process GO 10,744 6.76
Cellular component GO 1,807 16.21
Molecular function GO 7,922 7.28
Conserved domain ENTREZ Domains 15,727 99.82
Structural property ENTREZ Structure 26,425 3.33
Gene Entrez Gene 395,611 6.09
Gene cluster UniGene 362,155 2.36
Homology group HomoloGene 35,478 14.68
OMIM entry OMIM 15,253 34.35
Protein Entrez Protein 741,856 5.36
Total 1,968,951
Table 1. Databases included in the Biomine snapshot used in the experiments.
The Biomine graph data model consists of various biological entities and
annotated relations between them. Large, annotated biological data sets can
be readily acquired from several public databases and imported into the graph
model in a relatively straightforward manner. Some of the databases used in
Biomine are summarized in Table 1. The snapshot of Biomine we use consists
of a total of 1,968,951 nodes and 7,008,607 edges. This particular collection of
data sets is not meant to be complete, but it certainly is sufficiently large and
versatile for real link discovery.
9 http://biomine.cs.helsinki.fi/
196
5 A case study
In the systems biology domain, our goal is to computationally help the experts
to find a creative interpretation of wet lab experiment results. In the particular
experiment, the task was to analyze microarray data in order to distinguish
between fast and slowly growing cell lines through differential expression of gene
sets, responsible for cell growth.
Enriched Gene Sets
1. SLOW-vs-FAST   GO Proc(’DNA metabolic process’) &
INTERACT( GO Comp(’cyclin-dep. protein kinase holoenzyme complex’))
2. SLOW-vs-FAST   GO Proc(’DNA replication’) &
GO Comp(’nucleus’) &
INTERACT( KEGG Path(’Cell cycle’))
3. SLOW-vs-FAST   . . .
Table 2. Top SEGS rules found in the cell growth experiment. The second rule states
that one possible distinction between the slow and fast growing cells is in genes participating
in the process of DNA replication which are located in the cell nucleus and
which interact with genes that participate in the cell cycle pathway.
Table 2 gives the top rules resulting from the SEGS search for enriched gene
sets. For each rule, there is a corresponding set of over expressed genes from
the experimental data. Figure 5 shows a part of the Biomine graph which links
a selected subset of enriched gene set to the rest of the nodes in the Biomine
graph.
The wet lab scientists have assessed that SEGS in combination with Biomine
provide additional hints on what to focus on when comparing the expression data
of cells. Additionally, such an in-silico analysis can considerably lower the costs
of in-vitro experiments with which the researchers in the wet lab are trying to
get a hint of a novel process or phenomena observed. This may be especially true
for situations when just knowing the final outcome one cannot explain the drug
effect, organ function, or disease satisfactorily. Namely, the gross, yet important
characteristics of the cells (organ function) are hidden (do not affect visual morphology)
or could not be recognized soon enough. An initial predisposition for
this approach is wide accessibility and low costs of high throughput microarray
analysis which generate appropriate data for in-silico analysis.
6 Conclusions
We presented SEGS+Biomine, a bisociation discovery system for exploratory
gene analytics. It is based on the non-trivial steps of subgroup discovery (SEGS)
and link discovery (Biomine). The goal of SEGS+Biomine is to enhance the
197
Fig. 5. Biomine subgraph related to five genes from the enriched gene set produced by
SEGS. Note that the gene and protein names are not explicitly presented, due to the
preliminary nature of these results.
creation of novel biological hypotheses about sets of genes. A prototype version
of the gene analytics software, which enhances SEGS and creates links to Biomine
queries and graphs, is available as a web application at
http://zulu.ijs.si/web/segs ga/.
In the future work we plan to enhance the contextualization of genes with contexts
discovered by biomedical literature mining. We will add PubMed articles
data into the BisoNet graph structure. To this end, we already have a preliminary
implementation of software, called Texas [5], which creates a probabilistic
network (BisoNet, compatible to Biomine) from textual sources. By focusing on
different types of links between terms (e.g., frequent and rare co-ocurances) we
expect to get hints at some unexpected relations between concepts from different
contexts.
Our long term goal is to help biologists better understand inter-contextual
links between genes and their role in explaining (at least qualitatively) underlying
mechanisms which regulate gene expressions. The proposed approach is
considered a first step at computational realization of bisociative reasoning for
creative knowledge discovery in systems biology.
7 Acknowledgements
The work presented in this paper was supported by the European Commission
under the 7th Framework Programme FP7-ICT-2007-C FET-Open project
198
BISON-211898, by the Slovenian Research Agency grants Knowledge Technologies
and Systems Biology J4-2228, and by the Algorithmic Data Analysis (Algodan)
Centre of Excellence of the Academy of Finland. We thank Igor Trajkovski
for his previous work on SEGS, and Filip ˇZelezn´y and Jakub Tolar for their
earlier contributions leading to SEGS.
<references_biblio/>
References
1. R. Agrawal, R. Srikant. Fast algorithms for mining association rules in large
databases. Proc. 20th Intl. Conf. on Very Large Data Bases, VLDB, pp. 487–499,
Santiago, Chile, September 1994.
2. W. Dubitzky. Personal communication, FP7 BISON project review, Leuven, June
2009.
3. D. Gamberger and N. Lavraˇc. Expert-guided subgroup discovery: Methodology
and application. Journal of Artificial Intelligence Research 17:501–527, 2002.
4. D. Gamberger, N. Lavraˇc, F. ˇZelezn´y, J. Tolar. Induction of comprehensible models
for gene expression datasets by the subgroup discovery methodology. Journal of
Biomedical Informatics 37:269–284, 2004.
5. M. Jurˇsiˇc, N. Lavraˇc, I. Mozetiˇc, V. Podpeˇcan, H. Toivonen. Constructing information
networks from text documents. ECML/PKDD 2009 Workshop ”Explorative
Analytics of Information Networks”, Bled, 2009.
6. S.Y. Kim, D.J. Volsky. PAGE: Parametric Analysis of Gene Set Enrichment. BMC
Bioinformatics 6:144, 2005.
7. A. Koestler. The Act of Creation, The Macmillan Co., New York, 1964.
8. I. Petriˇc, T. Urbanˇciˇcc, B. Cestnik, M. Macedoni-Lukˇsiˇc. Literature mining
method RaJoLink for uncovering relations between biomedical concepts. Journal
of Biomedical Informatics, 42(2): 219–227, 2009.
9. P. Sevon, L. Eronen, P. Hintsanen, K. Kulovesi, H. Toivonen. Link discovery in
graphs derived from biological databases. Proc. of 3rd International Workshop on
Data Integration in the Life Sciences, (DILS’06), July 2006. Springer.
10. D.R. Swanson, N.R. Smalheiser, V.I. Torvik. Ranking indirect connections in
literature-based discovery: The role of Medical Subject Headings (MeSH). JASIST
57(11): 1427-1439, 2006.
11. P. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, B.L. Ebert, M.A.
Gillette. Gene set enrichment analysis: A knowledge based approach for interpreting
genome-wide expression profiles. Proc. of the National Academy of Science,
USA, 102(43): 15545–15550, 2005.
12. I. Trajkovski. Functional interpretation of gene expression data. Ph.D. Thesis, Jozef
Stefan International Postgraduate School, December 2007.
13. I. Trajkovski, F. ˇZelezny, N. Lavraˇc, J. Tolar. Learning relational descriptions of
differentially expressed gene groups. IEEE Transactions of Systems, Man and Cybernetics
C, special issue on Intelligent Computation for Bioinformatics, 38(1):
16–25, 2008a.
14. I. Trajkovski, N. Lavraˇc, J. Tolar. SEGS: Search for enriched gene sets in microarray
data. Journal of Biomedical Informatics, 41(4): 588–601, 2008b.
15. F. ˇZelezny, N. Lavraˇc. Propositionalization-based relational subgroup discovery
with RSD. Machine Learning, 62(1–2): 33–63, 2007.
16. M. Weeber, H. Klein, L.T.W. de Jong-van den Berg, R. Vos. Using concepts in
literature-based discovery: Simulating Swanson’s Raynaud-fish oil and migrainemagnesium
discoveries. J. Am. Soc. Inf. Sci. Tech., 52(7): 548–557, 2001.