Cross-domain Literature Mining: Finding Bridging Concepts with CrossBee 0DWMDå-XUãLþ1,2, Bojan Cestnik3,1 7DQMD8UEDQþLþ4,11DGD/DYUDþ1,4 1 Jožef Stefan Institute, Ljubljana, Slovenia 2 International Postgraduate School Jožef Stefan, Ljubljana, Slovenia 3 Temida d.o.o., Ljubljana, Slovenia 4 University of Nova Gorica, Nova Gorica, Slovenia {matjaz.jursic, bojan.cestnik, tanja.urbancic, nada.lavrac}@ijs.si Abstract In literature-based creative knowledge discovery one of the challenging tasks is to identify interesting bridging terms or concepts which relate different domains. To find these bridging concepts, our cross-domain literature mining approach assumes that one first has to identify two seemingly unrelated domains of interest. Bridging terms, found in the intersection of these domains, are then ranked according to their potential to uncover useful, previously unexplored links between the two domains. Term ranking, based on voting of an ensemble of heuristics, is the main functionality of the CrossBee (Cross-Context Bisociation Explorer) system presented in this paper. The utility of the proposed approach is show-cased by exploring scientific papers on migraine and magnesium, which is a reference domain in literature mining. Introduction This paper 1 investigates the creative knowledge discovery process which has its grounds in Mednick’s associative creativity theory (Mednick 1962) and Koestler’s domaincrossing associations, called bisociations (Koestler 1964). Mednick defines creative thinking as the capacity of generating new combinations of distinct associative elements (concepts). He explains how thinking about the concepts that are not strictly related to the elements under investigation inspires unforeseen useful connections between these elements. On the other hand, according to Koestler, a bisociation is a result of creative processes of the mind when making completely new associations between concepts from domains that are usually considered separate. Consequently, discovering bisociations may considerably improve creative discovery processes. According to Koestler, 1 This work was supported by the European Commission under the 7th Framework Programme FP7-ICT-2007-C FET-Open project BISON-211898, and Slovenian Research Agency grant Knowledge Technologies (P2 0103). through the history of science, this mechanism has been a crucial element of progressive insights and paradigm shifts. The approach to creative knowledge discovery from text documents presented in this paper is based on identifying and exploring terms which have the potential to relate different domains of interest, i.e., two distinct domain literatures. While in general literature refers to any document corpus (articles, novels, stories, etc.), our approach to cross-domain literature mining focuses on the task of mining scientific papers in the so-called closed discovery2 setting (Weeber et al., 2001) where two domains of interest, A and C, are identified by the expert prior to starting the knowledge discovery process, and the goal is to find interesting bridging terms that relate the two literatures. Weeber et al. (2001) have followed the work of literature-based knowledge discovery in medical domains by Swanson (1986) who designed the so-called ABC model approach to investigate whether the phenomenon of interest C in the first domain is related to some phenomenon A in the other literature through some interconnecting phenomenon B addressed in both literatures. If the literature about C relates C with B, and the literature about A relates A with the same B, then combining these relations may suggest a relation between C and A. If closer inspection confirms that an uncovered relation between C and A is new, meaningful and interesting, this can be viewed as new evidence or considered as a new piece of knowledge. Smalheiser and Swanson (1998) developed an online system ARROWSMITH which takes as input two sets of titles from disjoint domains A and C and lists terms that are common to literatures A and C; the resulting bridging terms (b-terms, forming set B) are further investigated for their potential to generate new scientific hypotheses (see an 2 In contrast with closed discovery, open discovery leads the creative knowledge discovery process from a given starting domain towards a yet unknown second domain which at the end of this process turns out to be connected with the first one. International Conference on Computational Creativity 2012 33 example in Figure 1). Investigation of pairs of documents might seem rather straightforward, like in the example documents titled “Migraine treatment with calcium channel blockers” (Anderson et al., 1986) and “Magnesium: nature’s physiologic calcium blocker” (Iseri and French, 1984). However, it should be left to domain experts to check whether bridging term calcium channel blocker suggests a valid, new and interesting relation (in this case, the relation that migraine could be treated with magnesium). To this end, it is helpful not just to identify a set of candidate bridging terms B between literatures A and C, but also to provide an expert with an easy access to the documents to be checked and to support this laborious process by ranking bridging terms candidates in order to start the exploration by considering the most promising terms first. Figure 1. Gold standard cross-domain literature mining example: migraine (domain C) on the left, magnesium (domain A) on the right, and in between a selection of bridging terms B as identified by Swanson et al. (2006). The approach presented in this paper is closely related to bridging terms identification in the RaJoLink system (Urbanþiþ et al. 2007, Petriþ et al. 2009). RaJoLink can be used to identify interesting scientific articles in the PubMed database, to compute different statistics, and to analyze the articles with the aim to discover new knowledge. The RaJoLink method involves three principal steps, Ra, Jo and Link, which have been named after the key elements of each step: Rare terms, Joint terms and Linking terms, respectively. In the Ra step, interesting rare terms in literature about phenomenon C under investigation are identified. In the Jo step, all available articles about the selected rare terms are inspected and interesting joint terms that appear in the intersection of the literatures about rare terms are identified as the candidates for A. This results in a candidate hypothesis that C is connected with A. To provide explanation for hypotheses generated in the Jo step, in the final Link step the method searches for b-terms, linking literatures A and C. Note that steps Ra and Jo implement the open discovery, while step Link corresponds to the closed discovery process of searching for b-terms when A and C are already known (as illustrated in Figure 1). Focusing on the closed discovery process, the method proposed in this paper aims at finding bridging terms in documents of two given domains A and C, enabling the exploration of potentially interesting bisociative links between the given domains with the aid of an ensemble of new heuristics for bridging term discovery. Term ranking, based on voting of an ensemble of heuristics, is the main functionality of the new CrossBee (Cross Context Bisociation Exploration) system presented in this paper. To verify the utility of the proposed approach, CrossBee was tested on the problem of rediscovering links between migraine and magnesium literatures, first explored by Swanson (1986) and later by numerous other authors, including Weeber et al. (2001) and (Urbanþiþ et al. 2009). This paper is organized as follows. Section 2 presents and relates two creative knowledge discovery frameworks: Koestler’s bisociative link discovery (Koestler 1964) and Swanson’s ABC model of closed discovery in literature mining (Swanson 1986). It also relates our work to Boden’s definition of creativity (Boden 1992) and Wigging’s computational creativity definition (Wigging 2006). Section 3 presents the heuristics used for selecting the most promising bridging concepts (bridging terms or b-terms) in the intersection of two different sets of documents (two domains of interest), evaluated on the migraine-magnesium domain pair, explored originally in Swanson’s research. It also presents an ensemble heuristic composed of six selected elementary heuristics. Section 4 presents the functionality and implementation of our system CrossBee for crosscontext bridging term discovery. We conclude with a discussion and directions for further work. Koestler’s Bisociations, Cross-domain Literature Mining and Computational Creativity Let us present some background on the mechanism of bisociative reasoning which is at the heart of creative, accidental discovery, referred to as serendipity by Roberts (1989). Bisociative discovery, studied in this work, is focused on finding unexpected terms/concepts linking different domains. Scientific discovery requires creative thinking to connect seemingly unrelated information, for example, by using metaphors or analogies between concepts from different domains. These modes of thinking allow the mixing of conceptual categories or domains, which are normally separated. One of the functional bases for these modes is the idea of bisociation, coined by Artur Koestler (1964): “The pattern . . . is the perceiving of a situation or idea, L, in two self-consistent but habitually incompatible frames of reference, M1 and M2. The event L, in which the two intersect, is made to vibrate simultaneously on two different wavelengths, as it were. While this unusual situation lasts, L is not merely linked to one associative context but bisociated with two.” literature about migraine 5 hydroxytryptamine prostaglandin serotonin calcium channel blocker . . . literature about magnesium domain C bridging terms B domain A International Conference on Computational Creativity 2012 34 Koestler investigated bisociation as the basis for human creativity in seemingly diverse human endeavors, such as humor, science, and arts. In this paper we explore a specific pattern of bisociation (Berthold, 2012): terms, appearing in documents, which represent bisociative links between concepts of different domains, where the creative act is to find links which lead ‘out-of-the-plane’ in Koestler’s terms, i.e., links which cross two or more different domains. According to Berthold (2012), we claim that two concepts are bisociated if (a) there is no direct, obvious evidence linking them, (b) one has to cross domains to find the link, and (c) this new link provides some novel insight into the problem domain. We explore an approach to bisociative cross-domain link discovery, based on the identification and ranking of terms which have the potential of acting as previously unexplored links between different predefined domains of expertise. It can be seen that–in terms of the Swanson’s ABC model used in literature mining–this is an approach to closed knowledge discovery, where two domains of interest, A and C, are identified by the expert in advance. In terms of the Koestler’s model, the two domains, A and C, correspond to the two habitually incompatible frames of reference, M1 and M2. Moreover, the linking terms (called bridging terms or b-terms in this paper) that are common to literature A and C, explored by Smalheiser and Swanson (1998), clearly correspond to Koestler’s notion of a situation or idea, L, which is not merely linked to one associative domain but bisociated with two domains M1 and M2. Since our work originates from Koestler’s creative process definition, it naturally satisfies his notion of creativity. However, the concepts of creativity and computational creativity have several other definitions. We argue that our approach can be labeled as creative according to at least two other definitions, introduced by Boden (1992) and Wiggins (2006). Boden (1992) defines creativity as “the ability to come up with ideas or artefacts that are new, surprising and valuable.” Considering this definition, and given that the main output of our methodology is a ranked list of potentially interesting bridging terms/concepts, we argue that– although we do not produce new concepts–the ranking of potentially interesting bridging concepts itself may represent new, surprising and valuable ideas or artefacts. The proposed approach produces new term rankings, because– to the best of our knowledge–there are no similar methodologies available. The results are often also surprising, both because of their unlikeliness (as not commonly used terms may appear at the top of the ranked list) and their effect in subjective surprise (as noted by observing the expert using our system). The weakest claim we provide is the notion of value of the system as until now the developed approach did not yet produce any scientific breakthroughs; however, we already observed that it triggered novel insights by the expert who tested the early versions of our system. Therefore, we conclude that using Boden’s definition, the level of our systems creativity is limited by the value of its results and only the reduced exploration time and the number of users will show how valuable the system is and how valuable its results really are. Considering computational creativity, Wiggins (2006) proposes the following definition for which he states to be commonly adapted by the AI community: computational creativity refers to “performance of tasks (by a computer) which, if performed by a human, would be deemed creative.” We argue that, although the ranking problem we solve is not something people usually do, our system can be considered creative according to this definition. Take an analogy with online search engines whose task is finding documents and ranking the search results. We believe that, if such rankings were performed by a human, this could be considered as a very creative process. The final results of our methodology–the insights which might arise from using our system–could also be considered scientifically creative, where the ultimate creative act will be performed by the experts using the system and not the system alone. We designed the methodology in a way to enable the expert to be more productive when generating such creative ideas. Therefore, we argue that this added effectiveness of the expert’s creativity process originates from the system and its underlying methodology. Hence we believe our system possesses some elements of computational creativity proposed by Wiggins. Bridging Term Detection Methodology Creative thinking requires focusing on problems from new perspectives. In this paper we follow Koestler (1964), who investigated bisociation as the basis for human creativity, with a goal of developing a computational system with the ability to bridge different domains. Such relations between distinct domains can be revealed through bridging concepts (bridging terms, referred to as b-terms in this paper). Since this may lead to the generation of many possible ideas, the innovative generation of hypotheses as well as the support for facilitated exploration of alternatives are needed for creative cross-domain knowledge discovery. Based on this assumption, we have developed and experimented with different heuristics for finding bridging terms in the context of closed knowledge discovery from two different domains of expertise. The intuition behind this research is that by developing appropriate heuristics for term evaluation and ranking, this will enable the user to inspect only the top-ranked terms which should result in a high probability to find observations that may lead to the discovery of new bridges between the literatures of different domains. In summary, our research aim is to find cross-domain links by exploring the bridging terms in the intersection of two literatures that establish previously unknown links between literature A and literature C. In more detail, our method of b-term discovery is performed as follows. International Conference on Computational Creativity 2012 35 1. Perform text preprocessing to encode input texts into the standard bag-of-words (BoW) representation. As in standard text preprocessing for text mining, this is performed through a number of steps: a. text tokenization (where a continuous character sequence is split into meaningful sub-tokens, i.e., individual words or terms), b. stop-word removal (removing predefined words from a language that usually carry no relevant information:. and, or, a, an, the, ...), c. stemming or lemmatization (the process that converts each word/token into its morphologically neutral form), d. n-gram construction (n-grams are terms defined as a concatenation of 1 to n words which appear consecutively in the text), e. bag-of-words (BoW) representation, i.e., a vector representation of a document, with value 1 (or word frequency-based weight) for words/terms appearing in the document, and value 0 for the rest of the corpus vocabulary. 2. Calculate the values of heuristics which favor b-terms over other terms. 3. Sort the intersecting terms according to the values of the best performing heuristics and present the topranked terms (hopefully representing the b-terms) to the expert during interactive exploration of the two domains. The development of the best performing heuristics consisted of two phases: 1. Training: we proposed over 40 elementary heuristics, which vary from very simple term-frequency statistics to very elaborate combined measures. We then evaluated their quality on the migraine-magnesium gold standard domain investigated already by Swanson et al. (1988). Results of the evaluation were used to select some of the best performing and most complementary heuristics that were joined into a new ensemble heuristic. The ensemble heuristic proposed in this paper is generally more accurate and robust than any of the elementary heuristics used in its construction. 2. Testing: we independently evaluated the ensemble heuristic on a second dataset, autism-calcineurin docuPHQWVLQYHVWLJDWHGE\ 3HWULþHWDO WRFRQILUP its domain independence and its potential for b-term identification. Note that due to space restrictions, the description of testing of the system on the autismcalcineurin domain pair is out of the scope of this paper; the interested reader can find more information is provided in (JurãLþHWDO). Elementary Heuristics for b-term Detection We have proposed over 40 elementary heuristics for b-term evaluation (JurãLþ HW DO ), divided into four groups: frequency based, tf-idf based3 , similarity based, and outlier based heuristics. Most of these heuristics work fundamentally in a similar way: they manipulate the data present in the BoW document vector format to derive the term bisociation potential quality measure, named the bisociation score. The only exceptions are the outlier based heuristics which first detect outlier documents and then use the BoW vector information. Instead of providing the entire list of heuristics whose performance we tested extensively, we only specify a subset of these which we actually selected to construct the ensemble heuristic. The selected heuristics are defined as follows. Term to document frequency ratio: is a frequency based (ݐ) ೠ Τ஽ܿ݋ܦݐ݊ݑ݋ܿ (ݐ)ೠ஽݉ݎ݁ܶݐ݊ݑ݋ܿ = (ݐ)݋݅ݐܴܽݍ݁ݎ݂ heuristic defined as the ratio of the number of occurrences of term t in document set Du (named term frequency in tf-idf related text preprocessing contexts), and the number of documents where term t appears in document set Du (named document frequency in tf-idf related contexts). Sum of term’s importance in both domains: is a heuristic + (ݐ)భ஽݂݂݀݅ݐ = (ݐ)݉ݑܵ݊݉݋ܦ݂݂݀݅ݐ based on tf-idf metrics ݐ݂݂݅݀஽మ(ݐ), defined as a sum of tf-idf value of term t in the centroid vector of document set D1 plus term’s tf-idf value in the centroid vector of document set D2, where the centroid vector is defined as the sum of all document vectors and thus represents an average document of the given document collection. Sum of term frequencies in three outlier sets: is an outlier based heuristic ݎܨݐݑ݋݁ݍܵݑ݉(ݐ) = ܿݑ݋݊ݐܶ݁ݎ݉஽಴ೄ(ݐ) + which computes the (ݐ)ೄೇಾ஽݉ݎ݁ܶݐ݊ݑ݋ܿ + (ݐ)ೃಷ஽݉ݎ݁ܶݐ݊ݑ݋ܿ sum of term frequencies in three outlier sets, where the sets of outliers were identified by three classifiers (Sluban et al. 2012): Centroid Similarity (CS) classifier, Random Forest (RF) classifier, and Support Vector Machine (SVM) classifier. Relative frequencies in outlier sets: focusses on outlier .(ݐ) ೠ Τ஽݉ݎ݁ܶݐ݊ݑ݋ܿ (ݐ)ೄ಴஽݉ݎ݁ܶݐ݊ݑ݋ܿ = (ݐ)ܵܥ݈ܴ݁ݍ݁ݎܨݐݑ݋ sets Documents in the outliers set frequently embody new information that is often hard to explain in the context of existing knowledge. We concentrate on specific outliers– domain outliers–i.e., documents that tend to be more similar to the documents of the other domain than to those of their own domain. The procedure that we use to detect outlier documents first builds a classification model for each domain and then classifies all the documents using the trained classifier. The documents that are misclassified are declared as outlier documents, since according to the classification model they do not belong to their initial domain. The other two outlier based heuristics–relative frequency in the RF outlier set (outFreqRelRF) and relative 3 tf-idf stands for Term Frequency Inverse Document Frequency word weight computation, used in text mining (Feldman and Sanger, 2007). International Conference on Computational Creativity 2012 36 frequency in the SVM outlier set (outFreqRelSVM)–are defined in the same way as the outFreqRelCS heuristic. We have defined also a supplementary baseline heuristic: ݎܽ݊݀݋݉(ݐ) = ݎܽ݊݀ܰݑ݉() which serves as a baseline for the others, as it returns a random number from interval (0,1) regardless of a term under investigation. Evaluation of Elementary Heuristics To test the proposed heuristics for b-term detection, we have evaluated them on the problem of detecting bisociative links between migraine and magnesium in the respective literatures. To this end, we replicated the early Swanson’s migraine-magnesium experiment that represents a gold standard for literature-based discovery. The evaluation procedure used in this experiment differs from the original Swanson’s method and the RaJoLink method in that a human expert was not involved. Magnesium deficiency has been shown in several studies to cause migraine headaches (e.g., Swanson 1990; Thomas et al. 1992; Thomas et al. 2000; Demirkaya et al. 2001; Trauninger et al. 2002). In the literature-based closed discovery process Swanson managed to find more than 60 pairs of articles connecting the migraine domain with the magnesium deficiency via several bridging concepts. His closer inspection of the literature about migraine and the literature about magnesium showed that 11 pairs of documents, when put together, provided confirmation of a hypothesis that magnesium deficiency may cause migraine headaches (Swanson 1990). Some of the detected bridging terms are shown in Figure 1. Similar to Swanson’s original study of the migraine literature (Swanson 1988) we used titles as input to our closed discovery process. We performed the experiments on a subset of PubMed titles of articles that were published before 1988 (i.e., before Swanson’s literature-based discovery of the migraine-magnesium relation) and were retrieved with the PubMed search engine. As a result we got 2,425 migraine and 5,633 magnesium titles of PubMed articles. These article titles were preprocessed with standard text mining techniques resulting in 13,525 distinct terms which were analyzed and scored by presented elementary heuristics. Each heuristic assigned a score to every term from the list. Afterwards we sorted all 7 lists (6 elementary heuristics and the baseline heuristic) and thus, created 7 ranked lists of terms. Among these 13,525 terms, there were also all 43 terms which Swanson (1988) marked as b-terms and which we hoped to propagate to the top of the ranked list using the designed heuristics methodology. The b-terms identified by Swanson, verified by the expert to provide new discoveries in the field, are used as a gold standard in the evaluation in this work. We compared the heuristics based on their ROC (Receiver Operating Characteristic) curves and AUC (Area Under ROC) analysis. The idea underlying ROC curve construction is the following: go from the beginning of a ranked list and every time a b-term is seen, draw line up on the ROC canvas, otherwise draw line right. The ideal curve (when all b-terms are at the very beginning of a ranked list) would go straight up to the top followed by straight right section to the rightmost part of graph. Area under the ideal ROC curve is equal to 1 when both scales are normalized. ROC analysis (see Figure 2) shows the performance of elementary heuristics on the migraine-magnesium gold standard dataset. Details on heuristics evaluation can be found in (Juršiþ et al. 2012), while the main observations and results are outlined below. It can be observed that some heuristics are really well constructed for the purpose of b-term discovery. We are especially satisfied with heuristics which have good performance at the start of the ranked list, e.g., heuristic outFreqRelRF places four bterms already among the first 50 terms in its ranked list, while the random approach finds less than one b-term among its first 200 terms. On the other hand some heuristics do not perform so well at the start of the list, e.g., outFreqSum and tfidfDomnSum do not look promising at the first sight. However, we included them into the set of six heuristics on the basis of complementarity–so that they fit together well when used in the ensemble heuristics– providing not only better performance but also greater robustness of the ensemble. Figure 2. ROC curves representing the performance of elementary heuristics on the learning (migraine-magnesium) dataset. The Ensemble Heuristic The ensemble heuristic is a heuristic which combines results of the selected elementary heuristics (outFreqRelRF, outFreqRelSVM, outFreqRelCS, outFreqSum, tfidfDomnSum, and, freqRatio) into an aggregated result. In principle, the ensemble heuristic score is a sum of two parts: the 0 5 10 15 20 25 30 35 40 0 300 600 900 1.200 1.500 1.800 freqRatio (93.35%) tfidfDomnSum (93.85%) outFreqSum (94.96%) outFreqRelRF (95.24%) outFreqRelSVM (95.06%) outFreqRelCS (94.96%) random (50%) International Conference on Computational Creativity 2012 37 ensemble voting score and the ensemble position score and is computed as: ݏ௧ = ݏ௧ ௩௢௧௘ + ݏ௧ ௣௢௦. 1. Ensemble voting score (ݏ௧ ௩௢௧௘) of term t is based on the number of times the term appears in the first third of the elementary heuristics ranked lists. Each selected base heuristic ݄௜ gives one vote (ݏ௧ೕ,௛೔ ௩௢௧௘ = 1) to each term which is in the first third in its ranked list of terms and zero votes to all the other terms (ݏ௧ೕ,௛೔ ௩௢௧௘ = 0). Formally, the ensemble voting score of a term ݐ௝ that is at position ݌௝ in the ranked list of ݊ terms is computed as a sum of individual heuristics’ voting scores: ௧ೕݏ ௩௢௧௘ = ෍ ݏ௧ೕ,௛೔ ௩௢௧௘ ௞ ௜ୀଵ = ෍ ൜ 1: ݌௝ < ݊/3, ݁ݏ݅ݓݎ݄݁ݐ݋ :0 ௞ ௜ୀଵ . Therefore, each term can get a score ݏ௧ೕ ௩௢௧௘ א {0, 1, 2, … , ݇}, where ݇ is the number of base heuristics used in the ensemble. 2. Ensemble position score (ݏ௧ ௣௢௦) of term t is based on an average position of the term in the elementary heuristics ranked lists. For each heuristic ݄௜, the term’s position score ݏ௧ೕ,௛೔ ௣௢௦ is calculated as ൫݊ െ ݌௝൯Τ݊, which result in position scores being in the interval [0,1). For an ensemble of ݇ heuristics, the ensemble position score is computed as an average of individual heuristics’ position scores: s୲ౠ ୮୭ୱ = 1 k෍ s୲ౠ,୦౟ ୮୭ୱ ୩ ୧ୀଵ = 1 k෍ (n െ p୨) n ୩ ୧ୀଵ . Using the migraine-magnesium domain pair, we experimentally confirmed–through the ROC curve evaluation of different heuristics in terms of the quality of b-term retrieval–that the ensemble heuristic is the best measure for b-term detection and is able to retrieve b-terms approximately 7 times faster compared to the random approach. Besides testing on the migraine-magnesium dataset we evaluated the ensemble heuristic also on an independent autism-calcineurin dataset 3HWULþ HW DO  and confirmed the utility and domain independence of the proposed approach. The CrossBee System This section presents our system which helps the experts in searching for hidden links that connect two seemingly unrelated domains. We designed and implemented an online system named CrossBee (Cross-Context Bisociation Explorer) 4 . The system was first designed as an online implementation of the ensemble ranking methodology. To the core functionality we have however added other functionalities and content presentations which effectively turned CrossBee into a user-friendly tool for ranking and exploration of bisociative terms that have the potential for cross-context link discovery. This enables the user not only 4 CrossBee is available at website: http://crossbee.ijs.si/. to spot but also to efficiently investigate terms that represent potential cross-domain links. Below we describe a typical use-case and the extended system’s functionality. A Typical CrossBee Use Case The most standard use case is the following. The user starts at the system’s home page by inputting two sets of documents of interest and by tuning the parameters of the system. The minimal required user’s input at this point is a file with the documents from two domains. The prescribed format of the input file is kept simple to enable all users, regardless of their computing skills, to prepare the files. Each line of the file contains exactly three tab-separated entries: (a) document identification number, (b) domain acronym, and (c) the document text. The other options available to the user include specifying the exact preprocessing options, specifying the base heuristics to be used in the ensemble, specifying outlier documents identified by an external outlier detection software, defining the already known b-terms, and others. When the user selects all the desired options he proceeds to the next step. CrossBee then starts a computationally very intensive step in which it prepares all the data needed for the fast subsequent exploration phase. During this step the actual text preprocessing, base heuristics, ensemble, bisociation scores and rankings are computed in the way presented in the previous section. This step does not require any user intervention. After computation, the user is presented with a ranked list of b-term candidates. The list provides the user with some additional information including the ensemble’s individual base heuristics votes and term’s domain occurrence statistics in both domains. The user then browses through the list and chooses the term(s) he believes to be promising for finding meaningful connections between the two domains. At this point, the user can start inspecting the actual appearances of the selected term in both domains, using the efficient side-by-side document inspection as shown in Figure 3. In this way, he can verify whether his rationale behind selecting this term as a bridging term can be justified based on the contents of the inspected documents. The most important result of the exploration procedure is a proof for a chosen term to be an actual bridge between the two domains, based on supporting facts from the documents. As experienced in sessions with the experts, the identified documents are an important result as well, as they usually turn out to be a valuable source of information providing a deeper insight into the discovered terms which indicate new cross-domain relations. Extended CrossBee Functionality Below we list the implemented functionalities of the CrossBee system. International Conference on Computational Creativity 2012 38 x Document focused exploration empowers the user to filter and order the documents by various criteria. The user can find it more pleasing to start exploring the domains by reading documents and not browsing through the term lists. The ensemble ranking can be used to propose the user which documents to read by suggesting those with the highest proportion of highly ranked terms. x Detailed document view provides a more detailed presentation of a single document including various term statistics and a similarity graph showing the similarity between this document and other documents from the dataset. x Methodology performance analysis supports the evaluation of the methodology by providing various data which can be used to measure the quality of the results, e.g., data for plotting the ROC curves. x High-ranked term emphasis marks the terms according to their bisociation score calculated by the ensemble heuristic. When using this feature all high-ranked terms are emphasized throughout the whole application making them easier to spot. x b-term emphasis marks the terms defined as b-terms by the user. x Domain separation is a simple but effective option which colors all the documents from the same domain with the same color, making an obvious distinction between the documents from the two domains. x UI customization enables the user to decrease or increase the intensity of the following features: highranked term emphasis, b-term emphasis and domain separation. In cooperation with the experts, we discovered that some of them do like the emphasizing features while the others do not. Therefore, we introduced the UI customization where everybody can set the intensity of these features according to their preferences. Figure 3. Illustration of the side-by-side document inspection of potential cross-domain links functionality of CrossBee, using an example from the migraine-magnesium dataset analysis, focusing on the analysis of the paroxysmal term. Discussion and Further Work Current literature-based approaches depend strictly on simple, associative information search. Commonly, a literature-based association is computed using measures of similarity or co-occurrence. Because of their ‘hard-wired’ underlying criteria of co-occurrence or similarity, association-based methods often fail to discover relevant information which is not related in obvious associative ways. Especially information related across separate domains is hard to identify with the conventional associative approaches. In such cases the domain-crossing connections, called bisociations (Berthold, 2012), can help generate creative and innovative discoveries. There was previous research by Swanson (1986), Weeber et al. (2001), PetriþHWDO ) and several other International Conference on Computational Creativity 2012 39 authors investigating the means for finding novel interesting connections between disparate research findings which can be extracted from the published literature. They have shown that the analysis of implicit cross-context associations hidden in scientific literature can guide hypotheses formulation and lead to the discovery of new knowledge. The methodology presented in this paper has the potential for improved computational creativity in supporting the expert in the task of cross-domain literature mining. The main novelty is an approach to ensemble-based bridging term ranking. The creative act of finding bridging terms is supported by the user-friendly CrossBee system for literature mining, implementing closed cross-domain link discovery. It has the potential to identify bridging concepts in the intersection of different domain literatures, as confirmed in the experiments in mining the literature on migraine and magnesium In further work we will apply the CrossBee system to new domain pairs, focusing on the system’s potential to lead to new scientific discoveries. In addition to linking to PubMed, we will explore also the ways to connect CrossBee to other document sources, including its connection to keyword search from documents on the web, Moreover, it would be interesting to explore the potential of CrossBee in media research, as well as linguistics where metaphors could potentially be discovered by cross-context text mining. One of the priorities of our work will be, however, to use CrossBee in collaboration with the experts from different fields (e.g. physicists and biologists) to address real life domain problems and to get valuable feedback from these targeted users. References Berthold, M., ed. 2012. Bisociative Knowledge Discovery. Springer 2012 (in press). Boden, M. 1992. The Creative Mind. London: Abacus. Demirkaya, S.; Vural, O.; Dora, B.; and Topcuoglu, M.A. 2001. Efficacy of intravenous magnesium sulfate in the treatment of acute migraine attacks. Headache 41(2): 171- 177. Feldman, R. and Sanger, J. 2007. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press. -XUãLþ0; Sluban, B.; Cestnik, B.; Grþar, M.; and Lavraþ, N. 2012. Bridging concept identification for constructing information networks from text documents. In: Berthold, M.R. ed., Bisociative Knowledge Discovery. Springer LNAI 7250 (in press). Koestler, A. 1964. The Act of Creation. New York: MacMillan. Mednick, S.A. 1962. The associative basis of the creative process. Psychol. Rev. 69: 220-232. 3HWULþ, I.; 8UEDQþLþ, T.; Cestnik, B.; and Macedoni-/XNãLþ, M. (2009) Literature mining method RaJoLink for uncovering relations between biomedical concepts. J. Biomed. Inform. 42(2): 219-227. Roberts, R.M. 1989. Serendipity: Accidental Discoveries in Science. Wiley. Sluban, B.; Juršiþ, M.; Cestnik, B.; and Lavraþ, N. 2012. Exploring the power of outliers for cross-domain literature mining. In: Berthold, M.R. ed., Bisociative Knowledge Discovery. Springer LNAI 7250 (in press). Smalheiser, N.R., and Swanson, D.R. 1998. Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. Comput. Methods Programs Biomed. 57(3): 149-153. Swanson, D.R. 1986. Undiscovered public knowledge. Library Quarterly 56(2): 103-118. Swanson, D.R. 1988. Migraine and magnesium: Eleven neglected connections. Perspectives in Biology and Medicine 31(4): 526–557 Swanson, D.R. 1990. Medical literature as a potential source of new knowledge. Bull. Med. Libr. Assoc. 78(1): 29–37. Swanson, D.R.; Smalheiser, N.R.; and Torvik, V.I. 2006. Ranking indirect connections in literature-based discovery: The role of Medical Subject Headings (MeSH). J. Am. Soc. Inf. Sci. Tec. 57(11): 1427-1439. Thomas, J.; Millot, J.M.; Sebille, S.; Delabroise, A.M.; Thomas, E.; Manfait, M.; and Arnaud, M.J. 2000. Free and total magnesium in lymphocytes of migraine patients - effect of magnesium-rich mineral water intake. Clin. Chim. Acta 295(1-2): 63-75. Thomas, J.; Thomas, E.; and Tomb, E. 1992. Serum and erythrocyte magnesium concentrations and migraine. Magnes. Res. 5(2): 127-130. Trauninger, A.; Pfund, Z.; Koszegi, T.; and Czopf, J. 2002. Oral magnesium load test in patients with migraine. Headache 42(2): 114-119. 8UEDQþLþ, T.; 3HWULþ, I.; Cestnik, B.; and Macedoni-/XNãLþ, M. 2007. Literature mining: towards better understanding of autism. In: Bellazzi, R.; Abu-Hanna, A.; and Hunter, J., eds., In Proceedings of the 11th Conference on Artificial Intelligence in Medicine in Europe, 217-226. Springer. 8UEDQþLþ73HWULþ,and Cestnik, B. 2009. RaJoLink: A method for finding seeds of future discoveries in nowadays literature. In: Rauch, J., ed., Foundations of Intelligent Systems. LNAI, 5722. 129-138. Springer. Weeber, M.; Vos, R.; Klein, H.; and de Jong-van den Berg, L.T.W. 2001. Using concepts in literature-based discovery: Simulating Swanson’s Raynaud–fish oil and migraine– magnesium discoveries. J. Am. Soc. Inf. Sci. Tech. 52(7): 548-557. Wiggins, G. A. 2006. A Preliminary Framework for Description, Analysis and Comparison of Creative Systems. Journal of Knowledge Based Systems 19(7): 449–458. International Conference on Computational Creativity 2012 40