De ning Creativity: Finding Keywords for Creativity Using Corpus Linguistics Techniques Anna Jordanous Creative Systems Lab / Music Informatics Research Centre, School of Informatics, University of Sussex, UK a.k.jordanous at sussex.ac.uk Abstract. A computational system that evaluates creativity needs guid- ance on what creativity actually is. It is by no means straightforward to provide a computer with a formal de nition of creativity; no such de ni- tion yet exists and viewpoints in creativity literature vary as to what the key components of creativity are considered to be. This work combines several viewpoints for a more general consensus of how we de ne creativ- ity, using a corpus linguistics approach. 30 academic papers from various academic disciplines were analysed to extract the most frequently used words and their frequencies in the papers. This data was statistically compared with general word usage in written English. The results form a list of words that are signi cantly more likely to appear when talk- ing about creativity in academic texts. Such words can be considered keywords for creativity, guiding us in uncovering key sub-components of creativity which can be used for computational assessment of creativity. 1 Introduction How can a computational system perform autonomous evaluation of creativity? A seemingly simple way is to give the system a de nition of creativity which it can use to test whether creativity is present, and to what extent [1, 9, 11]. There have been many attempts to capture the nature of creativity in words [Appendix A lists 30 such papers], but there is currently no accepted consensus and many viewpoints exist which may prioritise di erent aspects of creativity (this is discussed further in Section 2.1). Identifying what contributes to our intuitive understanding of creativity can guide us towards a more formal de nition of the general concept of creativity. If a word is used signi cantly more often than expected to discuss creativity, then I suggest it is associated with the meaning of creativity. Many such words may be more tightly de ned than creativity itself; we can encode these de nitions in a computational test(s) and combine these tests to approximate a measurement of creativity. The intention of this approach is to make the goal of automated creativity assessment more manageable by reducing creativity to a set of more tractable sub-components, each of which is considered a key contributory factor towards creativity, recognised across a combination of di erent viewpoints. 278 2 Finding Keywords For Creativity The aim of this work is to nd words which are signi cantly more likely to be used in discussions of creativity across several disciplines. These words can be treated as keywords that highlight key components of creativity. What discussions of creativity should be examined? Written text is simpler to analyse than speech and there are many sources to choose from. The texts should be of a reasonable length, otherwise they provide only an overview rather than investigating more subtle points which may be signi cant. This study con- centrates on the academic literature discussing creativity, in order to reduce variability in formats, facilitate discovery of key documents for inclusion and allow a measure of the in uence of the document (the number of citations). To nd words used speci cally in creativity literature, the language used in several papers was analysed to extract the frequencies with which individual words were used. These extracted word frequencies were statistically compared with data on how the English language is used in general written form. 2.1 Creativity Corpus: A Selection of Papers on Creativity The academic literature on the nature of creativity ranges over at least the past 60 years; arguably starting from Guilford's seminal 1950 presentation on what creativity is and how to detect it. Many repeated themes have emerged in the literature as important components of creativity. As an example, the word clouds1 in Figs. 1 and 2 show that the word new is frequently used in de nitions of creativity and also in discussions of what creativity is. Fig. 1. Most frequent words in 23 creativity de nitions (excluding common-use words) Wide variance can be found, though, in what are considered primary contrib- utory factors of creativity. For example psychometric tests for creativity (such as [12]) focus on problem solving and divergent thinking, rewarding the ability to move away from standard solutions to a problem. In contrast, much recent writ- ing in computational creativity (such as [9, 11]) places emphasis on novelty and 1 Generated using software at http://www.wordle.net 279 Fig. 2. Most frequently used words in 30 academic papers on creativity (ex- cluding common English words). Fig. 3. With creativity and creative re- moved (as they dominate the image) value as key attributes. Whilst there is some crossover, the di ering emphases give a subtly di erent interpretation of creativity across academic elds. This study considers 30 papers on the nature of creativity, written from a number of di erent perspectives. This set of papers is referred to in this paper as the creativity corpus2 and is detailed in Appendix A. The 30 papers were selected using criteria such as the paper's in uence over future work (particularly measured by number of citations), the year of publication, academic discipline and author(s). To match the diversity of opinions in creativity literature as closely as possible, the set of papers give viewpoints from many di erent authors, from psychology to computer science backgrounds and across time, from 1950 to the current year (2009). Figure 4 shows the distribution of papers by subject, according to journal classi cation in the academic database Scopus3. Fig. 4. Distribution of subject area of papers over time The methodology for this study placed some limitations on what papers could be used. Papers had to be written in English4 and had to be available in a format that plain text could be extracted from (this excluded books or book chapters). 2 A corpus is the set of all related data being analysed (plural: corpora). 3 Scopus classi es some journals under more than one subject area 4 All non-British word spellings were amended to British spellings before analysis 280 2.2 Data Preparation For each paper a plain text le was generated, containing the full text of that pa- per. All journal headers and copyright notices were removed from each paper, as were the author names and aliations, list of references and acknowledgements. All les were also checked for any non-ascii characters and anomalies that may have arisen during the creation of the text le. 2.3 Extraction of Word Frequencies from Data R is a statistical programming environment5that is useful for corpus linguistics analysis. Using R, a word frequency table was constructed from the 30 text les containing the creativity corpus. For each word6 in the text les, the frequency table listed: how many papers that word is used in and the number of times the word is used in the whole creativity corpus (all papers combined). 2.4 Post Processing of Results To reduce the size of the frequency table and focus on more important words, all hapaxes were removed (words which only appear once in the whole creativity corpus). Any strings of numbers returned as words in the frequency table were also removed. To lter out words that were not used by many authors, any words which appear in less than 5 out of 30 papers were also discarded. 2.5 Analysis of Results It is not enough to consider purely the word frequencies on their own: a distinc- tion is often made in linguistics [3, 6, 10] between very commonly used words (form or closed class words) and lower frequency words (content or open class words): when used more often than usual in a text, the open class words usually hold the most interesting or speci c content [3]. So for this study the most com- mon words overall are not necessarily the most useful; as the results in Table 1 show, the most frequent words overall are usually those expected to be proli c in any written texts. Removing stopwords (very commonly used English words such as \the" or \and") is not sucient for the purposes of this work: this study focusses on those words which are speci cally used more often than expected when discussing creativity, as opposed to other texts. A method for quantifying this usage is discussed in the remainder of this section. 5 http://www.r-project.org/ 6 A word is de ned as a string of letters delimited by spaces or punctuation. A com- pound term such as \problem-solving" was divided into \problem" and \solving". 281 Data on General Language Use: British National Corpus (BNC). The BNC is a collection of texts and transcriptions of speech, from a variety of sources of British English usage. The corpus comprises approximately 100 million words, of which around 89 million words are from written sources and the remainder from transcriptions of speech. This study only uses data on the written sources, excluding all transcriptions of speech, as the creativity corpus is also solely from written sources. The data used in this study was taken from [7]: relative word frequency data from a sample subset of the written part of the BNC. Before using this data, frequencies were extrapolated to estimate absolute values. Statistical Testing of Word Frequencies. It was expected that there is a relationship between how many times a word is used in the creativity corpus and how many times it is used in general writing: to use statistical terminology, that the two corpora are correlated. As the data in both corpora is ratio-scored (i.e. the data is measured on a quanti able scale), a Pearson correlation test can be performed on the word frequency counts for each corpus, to test the hypothesis that there is signi cant positive correlation. If there is signi cant evidence of correlation, then the words which do not follow the general trend of correlation are of most interest: speci cally the words that are used more frequently in the creativity corpus than would be expected given the frequency with which they appear in the BNC. A common way to measure this is to use the log likelihood ratio statistic G2[3; 6; 8; 10]7: G2 = 2 X oij(ln oij 􀀀 ln eij) (1) oij = actual observed no of occurrences of a word i in corpus j eij = expected no of occurrences of a word i in corpus j (see Eqn. 2): eij = (oij + oik)  total(j) (total(j) + total(k)) (2) total(j) = total number of words in corpus j The G2 value is a measure of how well data in one corpus ts a model distribution based on both corpora. The higher the G2 value, the more that word usage deviates from what is expected given this model. G2 measures the extent to which a word deviates from the model but does not indicate which corpus it appears more frequently than expected in. There- fore a subset of the results was discarded: only those words which appear more frequently than expected in the creativity corpus were retained. 7 An alternative to G2 is the chi-squared test (2): see [3, 5, 6, 8, 10] for discussion of why G2 is the more appropriate option for very large corpora. 282 3 Results 3.1 Raw Frequency Counts As can be seen by Table 1 and as discussed in Section 2.5, most words which appeared very frequently were common English words, not useful for this study. Table 1. Most frequently used words in the creativity corpus. Word Count in corpus Word Count in corpus Word Count in corpus of 8052 is 2412 as 1448 and 4988 that 2372 creativity 1433 to 4420 creative 1994 are 1294 in 3939 for 1716 this 1174 a 3647 be 1561 with 1116 Figure 2 shows the results with \common English words" removed (according to http://www.wordle.com); however as discussed in section 2.5, this study's focus is on how words are used in the creativity corpus compared to normal, so removing only wordle.com's stopwords is not sucient for our purposes. 3.2 Using the BNC data As expected, the creativity corpus and BNC word frequencies are signi cantly positively correlated, at a 99% level of con dence (p<0.01). Pearson correlation testing returned a value of +0.716. The results of this study returned 781 words which are signi cantly more likely to appear in creativity literature then in general for written English (at a 99% level of con dence). Table 2 shows the 100 words with the highest G2 score. 4 Discussion of Findings This work has generated a list of words which are signi cantly associated with academic discussions of what creativity is. The list is ordered by how likely these words are to appear in creativity literature, so the higher they are on the list, the more signi cantly they are associated with such discussions. While words such as divergent and originality have appeared high on the list, as expected, some interesting results have emerged which are more surprising at rst glance, for example openness is 6th and empirical is 21st. One notable observation is that process, in 9th position with a G2 value of 1986.72, is a good deal higher than product, in 409th place with a G2 value of 75.38. Although on closer inspection, the word process has been used in more di erent contexts 8 Both G2 values are still well above 6.63, the critical value for signi cance at p<0.01 283 than product, there are still surprisingly many discussions about the processes involved in creativity. This result provides intriguing evidence for the product vs. process debate in creativity assessment [1, 9, 11]. Table 2. Top 100 words in creativity corpus, sorted by descending signed G2 Some words appear surprisingly highly in Table 2, due to unexpectedly low frequencies being recorded in the BNC data. Two examples are because and found. This suggests two possibilities: either a slight weakness in the represen- tativeness of the sample BNC data from [7] (perhaps understandable given the sheer quantity of data in the BNC; no sample can be 100% representative of a larger set of data), or alternatively these words may be used more in academic writing than in everyday speech - see section 4.1 for further discussion of this. From inspection, such words seem relatively infrequent, however, compared to the large number of words which are recognisably associated with creativity in at least some academic domains. 4.1 Further Exploration of Keywords Words in Common Academic Usage. It is possible that some words feature highly in the results solely because they are common academic words. Therefore the results list should be compared to common academic words to see if there 284 is evidence of correlation between the two sets of data. If so, this should also be taken into account. Two lists of common words in academic English were found: the Academic Word List (AWL) [2] and the University Word List (UWL) [13]. Both contain groups of words, in order of frequency of usage speci cally in academic docu- ments (group 1 holds the most frequent words). Unlike the BNC corpus, the AWL and UWL only provides summary information on academic word usage with no actual frequency data per word; this limits what statistical testing can be performed. Spearman correlation testing returns a value of -0.236 correlation between the creativity corpus and the AWL and -0.210 correlation between the creativity corpus and the UWL. Neither correlation value is signi cant at p<0.01 (or p<0.05). As this indicates no signi cant relationship between the creativity corpus and either academic list, no correction should be made to the keyword results on account of either set of academic data. Poor availability of any other data on academic word usage hinders further investigation of this issue at present. Context and Semantics. Although the list of keywords hold much of interest in uncovering what is key to creativity, they rely purely on frequency of word usage. The results are not intended to account for the di erent contexts in which words are used; when analysing large corpora, exploring every word's semantic context would be highly time-consuming. Instead, the frequency results highlight keywords to focus on in the texts and examine in more detail [6, 10]. Categorising the keywords by semantics is non-trivial and \labour-intensive" [4]. Carrying this out empirically would be a signi cant step in itself and is a fruitful avenue for further work. From inspection of the contexts in which keywords are used, some key categories are suggested in Table 3. 5 Conclusions For a computational system to be able to perform automated assessment of cre- ativity by a computational system, it needs some point of reference on what creativity is. There is no accepted consensus on the exact de nition of creativ- ity. This work empirically derives a set of keywords that combine a variety of viewpoints from di erent perspectives, for a more universal encapsulation of creativity. Keywords were calculated through corpus analysis of 30 academic papers on the nature of creativity. The likelihood measure G2 (Eqn. 1) was used to compare word frequencies in the creativity papers against usage of those words in general written English, as represented by the sub-corpus of the BNC containing written texts (see Section 2.5). This analysis returned 781 words which were statistically more common in the creativity literature sample than expected, given their general usage in written English. Table 2 displays the top 100 results. The list of keywords encapsulates words we commonly use to describe and analyse creativity in academia. Given their strong association with creativity, 285 Table 3. Key categories for creativity, generated through examining the keywords Category Keywords representing this category cognitive processes thinking, primary, conceptual, cognition, perceptual originality innovation, originality, novelty the creative individual personality, motivation, traits, individual, intrinsic, self ability solving, intelligence, facilitate, uency, knowledge, IQ in uences in uences, problem, extrinsic, example, interactions, domain divergence divergent, investigations, uency, ideas, research, discovery autonomy unconscious discovery openness, awareness, search, discovery, uency, research dimensions dimensions, attributes, factors, criterion association associative, correlation, related, combinations, semantic product artefacts, artistic, elements, verbal value motivation, artistic, solving, positive, validation, retention study of creativity empirical, predictions, tests, hypothesis, validation, research measuress of creativity scores, scales, empirical, ratings, criterion, measures, tests evolution of creativity developmental, primary, evolutionary, primitive, basis replicating creativity programs, computational, process, heuristics they point us towards sub-components of creativity that contribute to our intu- itive understanding of what creativity is. Many of the keywords in the results can be tested for by a computer more easily than testing for creativity itself. For example: { Originality: Comparing products to other examples in that domain or to a prototype, to measure similarity { Ability: Depending on the domain, there are usually many standardised tests to measure competence in that domain { Divergence: Measuring variance of products against each other { Autonomy: Quanti ying the assistance needed during the creative process { Value: Again this is domain dependent and there will usually be many tests for value measurement in a particular domain The results presented in this paper identify key components of creativity through a combination of several viewpoints. These rsesults will be used to guide experiments implementing a computational system that evaluates creativity by testing for the key categories that have been identi ed. The experiments enable us to determine whether this approach to de ning creativity gives a good enough approximation for creativity evaluation, and if so, which combination of tests most closely replicates human assessment of creativity. Acknowledgements Nick Collins, Sandra Deshors, Clare Jonas and Luisa Natali all made useful comments during discussions of this work. 286 References [1] S. Colton. Creativity versus the perception of creativity in computational systems. In Proc. of AAAI Symposium on Creative Systems, pages 14{20, 2008. [2] A. Coxhead. A new academic word list. TESOL quarterly, 34(2):213{238, 2000. [3] T. Dunning. Accurate methods for the statistics of surprise and coincidence. Com- putational Linguistics, 19(1):61{74, 1993. [4] D. Glynn. Polysemy, syntax, and variation. In V. Evans and S. Pourcel, editors, New Directions in Cognitive Linguistics. John Benjamins, Amsterdam, 2009. [5] S. T. Gries. Null-hypothesis signi cance testing of word frequencies: a follow-up on kilgarri . Corpus Linguistics and Linguistic Theory, 1(2):277{294, 2005. [6] A. Kilgarri . Comparing corpora. International Journal of Corpus Linguistics, 6(1):97{133, 2001. [7] G. Leech, P. Rayson, and A. Wilson. Word Frequencies in Written and Spoken English. Longman, London, UK, 2001. [8] M. P. Oakes. Statistics for Corpus Linguistics. Edinburgh University Press, Edin- burgh, UK, 1998. [9] A. Pease, D. Winterstein, and S. Colton. Evaluating machine creativity. In Proc. of ICCBR Workshop on Approaches to Creativity, 2001. [10] P. Rayson and R. Garside. Comparing corpora using frequency pro ling. Proc. of ACL Workshop on Comparing Corpora, 2000. [11] G. Ritchie. Some empirical criteria for attributing creativity to a computer pro- gram. Minds and Machines, 17:67{99, 2007. [12] E. P. Torrance. Torrance tests of creative thinking. Scholastic Testing Service, Bensenville, IL, 1974. [13] G. Xue and I. S. P. Nation. A university word list. Language Learning and Communication, 3(2):215{229, 1984. Appendix A: Papers in the Creativity Corpus 287