Cross-domain Literature Mining:
Finding Bridging Concepts with CrossBee
0DWMDå-XUãLþ1,2, Bojan Cestnik3,1
7DQMD8UEDQþLþ4,11DGD/DYUDþ1,4
1 Jožef Stefan Institute, Ljubljana, Slovenia 2 International Postgraduate School Jožef Stefan, Ljubljana, Slovenia 3 Temida d.o.o., Ljubljana, Slovenia 4 University of Nova Gorica, Nova Gorica, Slovenia
{matjaz.jursic, bojan.cestnik, tanja.urbancic, nada.lavrac}@ijs.si
Abstract
In literature-based creative knowledge discovery one of
the challenging tasks is to identify interesting bridging
terms or concepts which relate different domains. To
find these bridging concepts, our cross-domain literature
mining approach assumes that one first has to identify
two seemingly unrelated domains of interest. Bridging
terms, found in the intersection of these domains,
are then ranked according to their potential to uncover
useful, previously unexplored links between the two
domains. Term ranking, based on voting of an ensemble
of heuristics, is the main functionality of the CrossBee
(Cross-Context Bisociation Explorer) system presented
in this paper. The utility of the proposed approach is
show-cased by exploring scientific papers on migraine
and magnesium, which is a reference domain in literature
mining.
Introduction
This paper
1 investigates the creative knowledge discovery
process which has its grounds in Mednick’s associative
creativity theory (Mednick 1962) and Koestler’s domaincrossing
associations, called bisociations (Koestler 1964).
Mednick defines creative thinking as the capacity of generating
new combinations of distinct associative elements
(concepts). He explains how thinking about the concepts
that are not strictly related to the elements under investigation
inspires unforeseen useful connections between these
elements. On the other hand, according to Koestler, a bisociation
is a result of creative processes of the mind when
making completely new associations between concepts
from domains that are usually considered separate. Consequently,
discovering bisociations may considerably improve
creative discovery processes. According to Koestler,

1 This work was supported by the European Commission under
the 7th Framework Programme FP7-ICT-2007-C FET-Open
project BISON-211898, and Slovenian Research Agency grant
Knowledge Technologies (P2 0103).
through the history of science, this mechanism has been a
crucial element of progressive insights and paradigm shifts.
The approach to creative knowledge discovery from text
documents presented in this paper is based on identifying
and exploring terms which have the potential to relate
different domains of interest, i.e., two distinct domain
literatures. While in general literature refers to any document
corpus (articles, novels, stories, etc.), our approach to
cross-domain literature mining focuses on the task of mining
scientific papers in the so-called closed discovery2
setting (Weeber et al., 2001) where two domains of interest,
A and C, are identified by the expert prior to starting
the knowledge discovery process, and the goal is to find
interesting bridging terms that relate the two literatures.
Weeber et al. (2001) have followed the work of literature-based
knowledge discovery in medical domains by
Swanson (1986) who designed the so-called ABC model
approach to investigate whether the phenomenon of interest
C in the first domain is related to some phenomenon A
in the other literature through some interconnecting phenomenon
B addressed in both literatures. If the literature
about C relates C with B, and the literature about A relates
A with the same B, then combining these relations may
suggest a relation between C and A. If closer inspection
confirms that an uncovered relation between C and A is
new, meaningful and interesting, this can be viewed as new
evidence or considered as a new piece of knowledge.
Smalheiser and Swanson (1998) developed an online
system ARROWSMITH which takes as input two sets of
titles from disjoint domains A and C and lists terms that are
common to literatures A and C; the resulting bridging
terms (b-terms, forming set B) are further investigated for
their potential to generate new scientific hypotheses (see an

2 In contrast with closed discovery, open discovery leads the
creative knowledge discovery process from a given starting domain
towards a yet unknown second domain which at the end of
this process turns out to be connected with the first one.
International Conference on Computational Creativity 2012 33
example in Figure 1). Investigation of pairs of documents
might seem rather straightforward, like in the example
documents titled “Migraine treatment with calcium channel
blockers” (Anderson et al., 1986) and “Magnesium: nature’s
physiologic calcium blocker” (Iseri and French,
1984). However, it should be left to domain experts to
check whether bridging term calcium channel blocker
suggests a valid, new and interesting relation (in this case,
the relation that migraine could be treated with magnesium).
To this end, it is helpful not just to identify a set of
candidate bridging terms B between literatures A and C,
but also to provide an expert with an easy access to the
documents to be checked and to support this laborious
process by ranking bridging terms candidates in order to
start the exploration by considering the most promising
terms first.
Figure 1. Gold standard cross-domain literature mining example:
migraine (domain C) on the left, magnesium (domain A) on the
right, and in between a selection of bridging terms B as identified
by Swanson et al. (2006).
The approach presented in this paper is closely related to
bridging terms identification in the RaJoLink system (Urbanþiþ
et al. 2007, Petriþ et al. 2009). RaJoLink can be
used to identify interesting scientific articles in the PubMed
database, to compute different statistics, and to analyze
the articles with the aim to discover new knowledge.
The RaJoLink method involves three principal steps, Ra,
Jo and Link, which have been named after the key elements
of each step: Rare terms, Joint terms and Linking
terms, respectively. In the Ra step, interesting rare terms in
literature about phenomenon C under investigation are
identified. In the Jo step, all available articles about the
selected rare terms are inspected and interesting joint terms
that appear in the intersection of the literatures about rare
terms are identified as the candidates for A. This results in
a candidate hypothesis that C is connected with A. To
provide explanation for hypotheses generated in the Jo
step, in the final Link step the method searches for b-terms,
linking literatures A and C. Note that steps Ra and Jo implement
the open discovery, while step Link corresponds
to the closed discovery process of searching for b-terms
when A and C are already known (as illustrated in Figure 1).
Focusing on the closed discovery process, the method
proposed in this paper aims at finding bridging terms in
documents of two given domains A and C, enabling the
exploration of potentially interesting bisociative links between
the given domains with the aid of an ensemble of
new heuristics for bridging term discovery. Term ranking,
based on voting of an ensemble of heuristics, is the main
functionality of the new CrossBee (Cross Context Bisociation
Exploration) system presented in this paper. To verify
the utility of the proposed approach, CrossBee was tested
on the problem of rediscovering links between migraine
and magnesium literatures, first explored by Swanson
(1986) and later by numerous other authors, including
Weeber et al. (2001) and (Urbanþiþ et al. 2009).
This paper is organized as follows. Section 2 presents
and relates two creative knowledge discovery frameworks:
Koestler’s bisociative link discovery (Koestler 1964) and
Swanson’s ABC model of closed discovery in literature
mining (Swanson 1986). It also relates our work to Boden’s
definition of creativity (Boden 1992) and Wigging’s
computational creativity definition (Wigging 2006). Section
3 presents the heuristics used for selecting the most
promising bridging concepts (bridging terms or b-terms) in
the intersection of two different sets of documents (two
domains of interest), evaluated on the migraine-magnesium
domain pair, explored originally in Swanson’s research. It
also presents an ensemble heuristic composed of six selected
elementary heuristics. Section 4 presents the functionality
and implementation of our system CrossBee for crosscontext
bridging term discovery. We conclude with a discussion
and directions for further work.
Koestler’s Bisociations, Cross-domain Literature
Mining and Computational Creativity
Let us present some background on the mechanism of
bisociative reasoning which is at the heart of creative,
accidental discovery, referred to as serendipity by Roberts
(1989). Bisociative discovery, studied in this work, is focused
on finding unexpected terms/concepts linking different
domains.
Scientific discovery requires creative thinking to connect
seemingly unrelated information, for example, by using
metaphors or analogies between concepts from different
domains. These modes of thinking allow the mixing of
conceptual categories or domains, which are normally
separated. One of the functional bases for these modes is
the idea of bisociation, coined by Artur Koestler (1964):
“The pattern . . . is the perceiving of a situation or
idea, L, in two self-consistent but habitually incompatible
frames of reference, M1 and M2. The event L, in
which the two intersect, is made to vibrate simultaneously
on two different wavelengths, as it were. While
this unusual situation lasts, L is not merely linked to
one associative context but bisociated with two.”
literature
about
migraine
5 hydroxytryptamine
prostaglandin
serotonin
calcium channel blocker
.
.
.
literature
about
magnesium
domain C bridging terms B domain A
International Conference on Computational Creativity 2012 34
Koestler investigated bisociation as the basis for human
creativity in seemingly diverse human endeavors, such as
humor, science, and arts.
In this paper we explore a specific pattern of bisociation
(Berthold, 2012): terms, appearing in documents, which
represent bisociative links between concepts of different
domains, where the creative act is to find links which lead
‘out-of-the-plane’ in Koestler’s terms, i.e., links which
cross two or more different domains. According to
Berthold (2012), we claim that two concepts are bisociated
if (a) there is no direct, obvious evidence linking them, (b)
one has to cross domains to find the link, and (c) this new
link provides some novel insight into the problem domain.
We explore an approach to bisociative cross-domain link
discovery, based on the identification and ranking of terms
which have the potential of acting as previously unexplored
links between different predefined domains of expertise.
It can be seen that–in terms of the Swanson’s ABC
model used in literature mining–this is an approach to
closed knowledge discovery, where two domains of interest,
A and C, are identified by the expert in advance. In
terms of the Koestler’s model, the two domains, A and C,
correspond to the two habitually incompatible frames of
reference, M1 and M2. Moreover, the linking terms (called
bridging terms or b-terms in this paper) that are common to
literature A and C, explored by Smalheiser and Swanson
(1998), clearly correspond to Koestler’s notion of a situation
or idea, L, which is not merely linked to one associative
domain but bisociated with two domains M1 and M2.
Since our work originates from Koestler’s creative process
definition, it naturally satisfies his notion of creativity.
However, the concepts of creativity and computational
creativity have several other definitions. We argue that our
approach can be labeled as creative according to at least
two other definitions, introduced by Boden (1992) and
Wiggins (2006).
Boden (1992) defines creativity as “the ability to come
up with ideas or artefacts that are new, surprising and valuable.”
Considering this definition, and given that the main
output of our methodology is a ranked list of potentially
interesting bridging terms/concepts, we argue that–
although we do not produce new concepts–the ranking of
potentially interesting bridging concepts itself may represent
new, surprising and valuable ideas or artefacts. The
proposed approach produces new term rankings, because–
to the best of our knowledge–there are no similar methodologies
available. The results are often also surprising, both
because of their unlikeliness (as not commonly used terms
may appear at the top of the ranked list) and their effect in
subjective surprise (as noted by observing the expert using
our system). The weakest claim we provide is the notion of
value of the system as until now the developed approach
did not yet produce any scientific breakthroughs; however,
we already observed that it triggered novel insights by the
expert who tested the early versions of our system. Therefore,
we conclude that using Boden’s definition, the level
of our systems creativity is limited by the value of its results
and only the reduced exploration time and the number
of users will show how valuable the system is and how
valuable its results really are.
Considering computational creativity, Wiggins (2006)
proposes the following definition for which he states to be
commonly adapted by the AI community: computational
creativity refers to “performance of tasks (by a computer)
which, if performed by a human, would be deemed creative.”
We argue that, although the ranking problem we
solve is not something people usually do, our system can
be considered creative according to this definition. Take an
analogy with online search engines whose task is finding
documents and ranking the search results. We believe that,
if such rankings were performed by a human, this could be
considered as a very creative process. The final results of
our methodology–the insights which might arise from
using our system–could also be considered scientifically
creative, where the ultimate creative act will be performed
by the experts using the system and not the system alone.
We designed the methodology in a way to enable the expert
to be more productive when generating such creative
ideas. Therefore, we argue that this added effectiveness of
the expert’s creativity process originates from the system
and its underlying methodology. Hence we believe our
system possesses some elements of computational creativity
proposed by Wiggins.
Bridging Term Detection Methodology
Creative thinking requires focusing on problems from new
perspectives. In this paper we follow Koestler (1964), who
investigated bisociation as the basis for human creativity,
with a goal of developing a computational system with the
ability to bridge different domains. Such relations between
distinct domains can be revealed through bridging concepts
(bridging terms, referred to as b-terms in this paper). Since
this may lead to the generation of many possible ideas, the
innovative generation of hypotheses as well as the support
for facilitated exploration of alternatives are needed for
creative cross-domain knowledge discovery.
Based on this assumption, we have developed and experimented
with different heuristics for finding bridging
terms in the context of closed knowledge discovery from
two different domains of expertise. The intuition behind
this research is that by developing appropriate heuristics
for term evaluation and ranking, this will enable the user to
inspect only the top-ranked terms which should result in a
high probability to find observations that may lead to the
discovery of new bridges between the literatures of different
domains.
In summary, our research aim is to find cross-domain
links by exploring the bridging terms in the intersection of
two literatures that establish previously unknown links
between literature A and literature C. In more detail, our
method of b-term discovery is performed as follows.
International Conference on Computational Creativity 2012 35
1. Perform text preprocessing to encode input texts into
the standard bag-of-words (BoW) representation. As
in standard text preprocessing for text mining, this is
performed through a number of steps:
a. text tokenization (where a continuous character
sequence is split into meaningful
sub-tokens, i.e., individual words or terms),
b. stop-word removal (removing predefined
words from a language that usually carry no
relevant information:. and, or, a, an, the, ...),
c. stemming or lemmatization (the process that
converts each word/token into its morphologically
neutral form),
d. n-gram construction (n-grams are terms defined
as a concatenation of 1 to n words
which appear consecutively in the text),
e. bag-of-words (BoW) representation, i.e., a
vector representation of a document, with
value 1 (or word frequency-based weight) for
words/terms appearing in the document, and
value 0 for the rest of the corpus vocabulary.
2. Calculate the values of heuristics which favor b-terms
over other terms.
3. Sort the intersecting terms according to the values of
the best performing heuristics and present the topranked
terms (hopefully representing the b-terms) to
the expert during interactive exploration of the two
domains.
The development of the best performing heuristics consisted
of two phases:
1. Training: we proposed over 40 elementary heuristics,
which vary from very simple term-frequency statistics
to very elaborate combined measures. We then evaluated
their quality on the migraine-magnesium gold
standard domain investigated already by Swanson et
al. (1988). Results of the evaluation were used to select
some of the best performing and most complementary
heuristics that were joined into a new ensemble
heuristic. The ensemble heuristic proposed in this
paper is generally more accurate and robust than any
of the elementary heuristics used in its construction.
2. Testing: we independently evaluated the ensemble
heuristic on a second dataset, autism-calcineurin docuPHQWVLQYHVWLJDWHGE\
3HWULþHWDOWRFRQILUP
its domain independence and its potential for b-term
identification. Note that due to space restrictions, the
description of testing of the system on the autismcalcineurin
domain pair is out of the scope of this paper;
the interested reader can find more information is
provided in (JurãLþHWDO).
Elementary Heuristics for b-term Detection
We have proposed over 40 elementary heuristics for b-term
evaluation (JurãLþ HW DO ), divided into four groups:
frequency based, tf-idf based3
, similarity based, and outlier
based heuristics. Most of these heuristics work fundamentally
in a similar way: they manipulate the data present in
the BoW document vector format to derive the term bisociation
potential quality measure, named the bisociation
score. The only exceptions are the outlier based heuristics
which first detect outlier documents and then use the BoW
vector information.
Instead of providing the entire list of heuristics whose
performance we tested extensively, we only specify a subset
of these which we actually selected to construct the
ensemble heuristic. The selected heuristics are defined as
follows.
Term to document frequency ratio: is a frequency based
(ݐ) ೠ Τ஽ܿ݋ܦݐ݊ݑ݋ܿ (ݐ)ೠ஽݉ݎ݁ܶݐ݊ݑ݋ܿ = (ݐ)݋݅ݐܴܽݍ݁ݎ݂ heuristic
defined as the ratio of the number of occurrences of term t
in document set Du (named term frequency in tf-idf related
text preprocessing contexts), and the number of documents
where term t appears in document set Du (named document
frequency in tf-idf related contexts).
Sum of term’s importance in both domains: is a heuristic
+ (ݐ)భ஽݂݂݀݅ݐ = (ݐ)݉ݑܵ݊݉݋ܦ݂݂݀݅ݐ based on tf-idf metrics
ݐ݂݂݅݀஽మ(ݐ), defined as a sum of tf-idf value of term t in the
centroid vector of document set D1 plus term’s tf-idf value
in the centroid vector of document set D2, where the centroid
vector is defined as the sum of all document vectors
and thus represents an average document of the given document
collection.
Sum of term frequencies in three outlier sets: is an outlier
based heuristic ݎܨݐݑ݋݁ݍܵݑ݉(ݐ) = ܿݑ݋݊ݐܶ݁ݎ݉஽಴ೄ(ݐ) +
 which computes the (ݐ)ೄೇಾ஽݉ݎ݁ܶݐ݊ݑ݋ܿ + (ݐ)ೃಷ஽݉ݎ݁ܶݐ݊ݑ݋ܿ
sum of term frequencies in three outlier sets, where the sets
of outliers were identified by three classifiers (Sluban et al.
2012): Centroid Similarity (CS) classifier, Random Forest
(RF) classifier, and Support Vector Machine (SVM) classifier.
Relative frequencies in outlier sets: focusses on outlier
.(ݐ) ೠ Τ஽݉ݎ݁ܶݐ݊ݑ݋ܿ (ݐ)ೄ಴஽݉ݎ݁ܶݐ݊ݑ݋ܿ = (ݐ)ܵܥ݈ܴ݁ݍ݁ݎܨݐݑ݋ sets
Documents in the outliers set frequently embody new information
that is often hard to explain in the context of
existing knowledge. We concentrate on specific outliers–
domain outliers–i.e., documents that tend to be more similar
to the documents of the other domain than to those of
their own domain. The procedure that we use to detect
outlier documents first builds a classification model for
each domain and then classifies all the documents using
the trained classifier. The documents that are misclassified
are declared as outlier documents, since according to the
classification model they do not belong to their initial domain.
The other two outlier based heuristics–relative frequency
in the RF outlier set (outFreqRelRF) and relative

3 tf-idf stands for Term Frequency Inverse Document Frequency
word weight computation, used in text mining (Feldman and
Sanger, 2007).
International Conference on Computational Creativity 2012 36
frequency in the SVM outlier set (outFreqRelSVM)–are
defined in the same way as the outFreqRelCS heuristic.
We have defined also a supplementary baseline heuristic:
ݎܽ݊݀݋݉(ݐ) = ݎܽ݊݀ܰݑ݉() which serves as a baseline
for the others, as it returns a random number from interval
(0,1) regardless of a term under investigation.
Evaluation of Elementary Heuristics
To test the proposed heuristics for b-term detection, we
have evaluated them on the problem of detecting bisociative
links between migraine and magnesium in the respective
literatures. To this end, we replicated the early Swanson’s
migraine-magnesium experiment that represents a
gold standard for literature-based discovery. The evaluation
procedure used in this experiment differs from the
original Swanson’s method and the RaJoLink method in
that a human expert was not involved.
Magnesium deficiency has been shown in several studies
to cause migraine headaches (e.g., Swanson 1990;
Thomas et al. 1992; Thomas et al. 2000; Demirkaya et al.
2001; Trauninger et al. 2002). In the literature-based closed
discovery process Swanson managed to find more than 60
pairs of articles connecting the migraine domain with the
magnesium deficiency via several bridging concepts. His
closer inspection of the literature about migraine and the
literature about magnesium showed that 11 pairs of documents,
when put together, provided confirmation of a hypothesis
that magnesium deficiency may cause migraine
headaches (Swanson 1990). Some of the detected bridging
terms are shown in Figure 1.
Similar to Swanson’s original study of the migraine literature
(Swanson 1988) we used titles as input to our
closed discovery process. We performed the experiments
on a subset of PubMed titles of articles that were published
before 1988 (i.e., before Swanson’s literature-based discovery
of the migraine-magnesium relation) and were
retrieved with the PubMed search engine. As a result we
got 2,425 migraine and 5,633 magnesium titles of PubMed
articles. These article titles were preprocessed with standard
text mining techniques resulting in 13,525 distinct
terms which were analyzed and scored by presented elementary
heuristics. Each heuristic assigned a score to every
term from the list. Afterwards we sorted all 7 lists (6 elementary
heuristics and the baseline heuristic) and thus,
created 7 ranked lists of terms. Among these 13,525 terms,
there were also all 43 terms which Swanson (1988) marked
as b-terms and which we hoped to propagate to the top of
the ranked list using the designed heuristics methodology.
The b-terms identified by Swanson, verified by the expert
to provide new discoveries in the field, are used as a gold
standard in the evaluation in this work.
We compared the heuristics based on their ROC (Receiver
Operating Characteristic) curves and AUC (Area
Under ROC) analysis. The idea underlying ROC curve
construction is the following: go from the beginning of a
ranked list and every time a b-term is seen, draw line up on
the ROC canvas, otherwise draw line right. The ideal curve
(when all b-terms are at the very beginning of a ranked list)
would go straight up to the top followed by straight right
section to the rightmost part of graph. Area under the ideal
ROC curve is equal to 1 when both scales are normalized.
ROC analysis (see Figure 2) shows the performance of
elementary heuristics on the migraine-magnesium gold
standard dataset. Details on heuristics evaluation can be
found in (Juršiþ et al. 2012), while the main observations
and results are outlined below. It can be observed that
some heuristics are really well constructed for the purpose
of b-term discovery. We are especially satisfied with heuristics
which have good performance at the start of the
ranked list, e.g., heuristic outFreqRelRF places four bterms
already among the first 50 terms in its ranked list,
while the random approach finds less than one b-term
among its first 200 terms. On the other hand some heuristics
do not perform so well at the start of the list, e.g., outFreqSum
and tfidfDomnSum do not look promising at the
first sight. However, we included them into the set of six
heuristics on the basis of complementarity–so that they fit
together well when used in the ensemble heuristics–
providing not only better performance but also greater
robustness of the ensemble.
Figure 2. ROC curves representing the performance of elementary
heuristics on the learning (migraine-magnesium) dataset.
The Ensemble Heuristic
The ensemble heuristic is a heuristic which combines results
of the selected elementary heuristics (outFreqRelRF,
outFreqRelSVM, outFreqRelCS, outFreqSum, tfidfDomnSum,
and, freqRatio) into an aggregated result. In principle,
the ensemble heuristic score is a sum of two parts: the
0
5
10
15
20
25
30
35
40
0 300 600 900 1.200 1.500 1.800
freqRatio (93.35%) tfidfDomnSum (93.85%)
outFreqSum (94.96%) outFreqRelRF (95.24%)
outFreqRelSVM (95.06%) outFreqRelCS (94.96%)
random (50%)
International Conference on Computational Creativity 2012 37
ensemble voting score and the ensemble position score and
is computed as: ݏ௧ = ݏ௧
௩௢௧௘ + ݏ௧
௣௢௦.
1. Ensemble voting score (ݏ௧
௩௢௧௘) of term t is based on the
number of times the term appears in the first third of
the elementary heuristics ranked lists. Each selected
base heuristic ݄௜ gives one vote (ݏ௧ೕ,௛೔
௩௢௧௘ = 1) to each
term which is in the first third in its ranked list of
terms and zero votes to all the other terms (ݏ௧ೕ,௛೔
௩௢௧௘ = 0).
Formally, the ensemble voting score of a term ݐ௝ that is
at position ݌௝ in the ranked list of ݊ terms is computed
as a sum of individual heuristics’ voting scores:
௧ೕݏ
௩௢௧௘ = ෍ ݏ௧ೕ,௛೔
௩௢௧௘ ௞
௜ୀଵ
= ෍ ൜
1: ݌௝ < ݊/3,
݁ݏ݅ݓݎ݄݁ݐ݋ :0
௞
௜ୀଵ
.
Therefore, each term can get a score ݏ௧ೕ
௩௢௧௘ א
{0, 1, 2, … , ݇}, where ݇ is the number of base heuristics
used in the ensemble.
2. Ensemble position score (ݏ௧
௣௢௦) of term t is based on an
average position of the term in the elementary heuristics
ranked lists. For each heuristic ݄௜, the term’s position
score ݏ௧ೕ,௛೔
௣௢௦ is calculated as ൫݊ െ ݌௝൯Τ݊, which result
in position scores being in the interval [0,1). For
an ensemble of ݇ heuristics, the ensemble position
score is computed as an average of individual heuristics’
position scores:
s୲ౠ
୮୭ୱ = 1
k෍ s୲ౠ,୦౟
୮୭ୱ
୩
୧ୀଵ
= 1
k෍ (n െ p୨)
n
୩
୧ୀଵ
.
Using the migraine-magnesium domain pair, we experimentally
confirmed–through the ROC curve evaluation of
different heuristics in terms of the quality of b-term retrieval–that
the ensemble heuristic is the best measure for
b-term detection and is able to retrieve b-terms approximately
7 times faster compared to the random approach.
Besides testing on the migraine-magnesium dataset we
evaluated the ensemble heuristic also on an independent
autism-calcineurin dataset 3HWULþ HW DO  and confirmed
the utility and domain independence of the proposed
approach.
The CrossBee System
This section presents our system which helps the experts in
searching for hidden links that connect two seemingly
unrelated domains. We designed and implemented an
online system named CrossBee (Cross-Context Bisociation
Explorer)
4
. The system was first designed as an online
implementation of the ensemble ranking methodology. To
the core functionality we have however added other functionalities
and content presentations which effectively
turned CrossBee into a user-friendly tool for ranking and
exploration of bisociative terms that have the potential for
cross-context link discovery. This enables the user not only

4 CrossBee is available at website: http://crossbee.ijs.si/.
to spot but also to efficiently investigate terms that represent
potential cross-domain links.
Below we describe a typical use-case and the extended
system’s functionality.
A Typical CrossBee Use Case
The most standard use case is the following. The user starts
at the system’s home page by inputting two sets of documents
of interest and by tuning the parameters of the system.
The minimal required user’s input at this point is a file
with the documents from two domains. The prescribed
format of the input file is kept simple to enable all users,
regardless of their computing skills, to prepare the files.
Each line of the file contains exactly three tab-separated
entries: (a) document identification number, (b) domain
acronym, and (c) the document text. The other options
available to the user include specifying the exact preprocessing
options, specifying the base heuristics to be used in
the ensemble, specifying outlier documents identified by
an external outlier detection software, defining the already
known b-terms, and others. When the user selects all the
desired options he proceeds to the next step.
CrossBee then starts a computationally very intensive
step in which it prepares all the data needed for the fast
subsequent exploration phase. During this step the actual
text preprocessing, base heuristics, ensemble, bisociation
scores and rankings are computed in the way presented in
the previous section. This step does not require any user
intervention.
After computation, the user is presented with a ranked
list of b-term candidates. The list provides the user with
some additional information including the ensemble’s
individual base heuristics votes and term’s domain occurrence
statistics in both domains. The user then browses
through the list and chooses the term(s) he believes to be
promising for finding meaningful connections between the
two domains.
At this point, the user can start inspecting the actual appearances
of the selected term in both domains, using the
efficient side-by-side document inspection as shown in
Figure 3. In this way, he can verify whether his rationale
behind selecting this term as a bridging term can be justified
based on the contents of the inspected documents.
The most important result of the exploration procedure
is a proof for a chosen term to be an actual bridge between
the two domains, based on supporting facts from the documents.
As experienced in sessions with the experts, the
identified documents are an important result as well, as
they usually turn out to be a valuable source of information
providing a deeper insight into the discovered terms which
indicate new cross-domain relations.
Extended CrossBee Functionality
Below we list the implemented functionalities of the
CrossBee system.
International Conference on Computational Creativity 2012 38
x Document focused exploration empowers the user to
filter and order the documents by various criteria. The
user can find it more pleasing to start exploring the
domains by reading documents and not browsing
through the term lists. The ensemble ranking can be
used to propose the user which documents to read by
suggesting those with the highest proportion of highly
ranked terms.
x Detailed document view provides a more detailed
presentation of a single document including various
term statistics and a similarity graph showing the similarity
between this document and other documents
from the dataset.
x Methodology performance analysis supports the evaluation
of the methodology by providing various data
which can be used to measure the quality of the results,
e.g., data for plotting the ROC curves.
x High-ranked term emphasis marks the terms according
to their bisociation score calculated by the ensemble
heuristic. When using this feature all high-ranked
terms are emphasized throughout the whole application
making them easier to spot.
x b-term emphasis marks the terms defined as b-terms
by the user.
x Domain separation is a simple but effective option
which colors all the documents from the same domain
with the same color, making an obvious distinction between
the documents from the two domains.
x UI customization enables the user to decrease or increase
the intensity of the following features: highranked
term emphasis, b-term emphasis and domain
separation. In cooperation with the experts, we discovered
that some of them do like the emphasizing features
while the others do not. Therefore, we introduced
the UI customization where everybody can set the intensity
of these features according to their preferences.
Figure 3. Illustration of the side-by-side document inspection of potential cross-domain links functionality of CrossBee, using an example
from the migraine-magnesium dataset analysis, focusing on the analysis of the paroxysmal term.
Discussion and Further Work
Current literature-based approaches depend strictly on
simple, associative information search. Commonly, a literature-based
association is computed using measures of
similarity or co-occurrence. Because of their ‘hard-wired’
underlying criteria of co-occurrence or similarity, association-based
methods often fail to discover relevant information
which is not related in obvious associative ways.
Especially information related across separate domains is
hard to identify with the conventional associative approaches.
In such cases the domain-crossing connections,
called bisociations (Berthold, 2012), can help generate
creative and innovative discoveries.
There was previous research by Swanson (1986),
Weeber et al. (2001), PetriþHWDO) and several other
International Conference on Computational Creativity 2012 39
authors investigating the means for finding novel interesting
connections between disparate research findings which
can be extracted from the published literature. They have
shown that the analysis of implicit cross-context associations
hidden in scientific literature can guide hypotheses
formulation and lead to the discovery of new knowledge.
The methodology presented in this paper has the potential
for improved computational creativity in supporting the
expert in the task of cross-domain literature mining. The
main novelty is an approach to ensemble-based bridging
term ranking. The creative act of finding bridging terms is
supported by the user-friendly CrossBee system for literature
mining, implementing closed cross-domain link discovery.
It has the potential to identify bridging concepts in
the intersection of different domain literatures, as confirmed
in the experiments in mining the literature on migraine
and magnesium
In further work we will apply the CrossBee system to
new domain pairs, focusing on the system’s potential to
lead to new scientific discoveries. In addition to linking to
PubMed, we will explore also the ways to connect CrossBee
to other document sources, including its connection to
keyword search from documents on the web, Moreover, it
would be interesting to explore the potential of CrossBee
in media research, as well as linguistics where metaphors
could potentially be discovered by cross-context text mining.
One of the priorities of our work will be, however, to
use CrossBee in collaboration with the experts from different
fields (e.g. physicists and biologists) to address real life
domain problems and to get valuable feedback from these
targeted users.
<references_biblio/>
References
Berthold, M., ed. 2012. Bisociative Knowledge Discovery.
Springer 2012 (in press).
Boden, M. 1992. The Creative Mind. London: Abacus.
Demirkaya, S.; Vural, O.; Dora, B.; and Topcuoglu, M.A.
2001. Efficacy of intravenous magnesium sulfate in the
treatment of acute migraine attacks. Headache 41(2): 171-
177.
Feldman, R. and Sanger, J. 2007. The Text Mining Handbook:
Advanced Approaches in Analyzing Unstructured
Data. Cambridge University Press.
-XUãLþ0; Sluban, B.; Cestnik, B.; Grþar, M.; and Lavraþ,
N. 2012. Bridging concept identification for constructing
information networks from text documents. In: Berthold,
M.R. ed., Bisociative Knowledge Discovery. Springer
LNAI 7250 (in press).
Koestler, A. 1964. The Act of Creation. New York: MacMillan.
Mednick, S.A. 1962. The associative basis of the creative
process. Psychol. Rev. 69: 220-232.
3HWULþ, I.; 8UEDQþLþ, T.; Cestnik, B.; and Macedoni-/XNãLþ,
M. (2009) Literature mining method RaJoLink for uncovering
relations between biomedical concepts. J. Biomed.
Inform. 42(2): 219-227.
Roberts, R.M. 1989. Serendipity: Accidental Discoveries in
Science. Wiley.
Sluban, B.; Juršiþ, M.; Cestnik, B.; and Lavraþ, N. 2012.
Exploring the power of outliers for cross-domain literature
mining. In: Berthold, M.R. ed., Bisociative Knowledge
Discovery. Springer LNAI 7250 (in press).
Smalheiser, N.R., and Swanson, D.R. 1998. Using
ARROWSMITH: a computer-assisted approach to formulating
and assessing scientific hypotheses. Comput. Methods
Programs Biomed. 57(3): 149-153.
Swanson, D.R. 1986. Undiscovered public knowledge.
Library Quarterly 56(2): 103-118.
Swanson, D.R. 1988. Migraine and magnesium: Eleven
neglected connections. Perspectives in Biology and Medicine
31(4): 526–557
Swanson, D.R. 1990. Medical literature as a potential
source of new knowledge. Bull. Med. Libr. Assoc. 78(1):
29–37.
Swanson, D.R.; Smalheiser, N.R.; and Torvik, V.I. 2006.
Ranking indirect connections in literature-based discovery:
The role of Medical Subject Headings (MeSH). J. Am. Soc.
Inf. Sci. Tec. 57(11): 1427-1439.
Thomas, J.; Millot, J.M.; Sebille, S.; Delabroise, A.M.;
Thomas, E.; Manfait, M.; and Arnaud, M.J. 2000. Free and
total magnesium in lymphocytes of migraine patients -
effect of magnesium-rich mineral water intake. Clin. Chim.
Acta 295(1-2): 63-75.
Thomas, J.; Thomas, E.; and Tomb, E. 1992. Serum and
erythrocyte magnesium concentrations and migraine. Magnes.
Res. 5(2): 127-130.
Trauninger, A.; Pfund, Z.; Koszegi, T.; and Czopf, J. 2002.
Oral magnesium load test in patients with migraine. Headache
42(2): 114-119.
8UEDQþLþ, T.; 3HWULþ, I.; Cestnik, B.; and Macedoni-/XNãLþ,
M. 2007. Literature mining: towards better understanding
of autism. In: Bellazzi, R.; Abu-Hanna, A.; and Hunter, J.,
eds., In Proceedings of the 11th Conference on Artificial
Intelligence in Medicine in Europe, 217-226. Springer.
8UEDQþLþ73HWULþ,and Cestnik, B. 2009. RaJoLink: A
method for finding seeds of future discoveries in nowadays
literature. In: Rauch, J., ed., Foundations of Intelligent
Systems. LNAI, 5722. 129-138. Springer.
Weeber, M.; Vos, R.; Klein, H.; and de Jong-van den Berg,
L.T.W. 2001. Using concepts in literature-based discovery:
Simulating Swanson’s Raynaud–fish oil and migraine–
magnesium discoveries. J. Am. Soc. Inf. Sci. Tech. 52(7):
548-557.
Wiggins, G. A. 2006. A Preliminary Framework for Description,
Analysis and Comparison of Creative Systems.
Journal of Knowledge Based Systems 19(7): 449–458.
International Conference on Computational Creativity 2012 40